Optimize CUDA kernels#3600
Open
guoqingbao wants to merge 3 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds another round of CUDA kernel optimizations for binary ops, cast ops, reductions, and selected unary ops.
The main focus is improving hot contiguous BF16/F32 paths by optimizing cuda kernels with specialized vectorized implementations, reducing indexing overhead, and using more efficient CUDA primitives where appropriate.
Combined, these CUDA kernel optimizations deliver up to 15% end-to-end speedup in various models.
Changes
Binary Ops
Updated
binary.cuandbinary_op_macros.cuh.badd_bf16bmul_bf16float4loads, processing 8 BF16 elements per load.__hadd2and__hmul2intrinsics.badd_f32bdiv_f32bmul_f32float4loads, processing 4 F32 elements per load.BINARY_OP_BF16_VECinbinary_op_macros.cuhfor reusable vectorized BF16 binary ops with float-promoted per-element computation.Cast Ops
Updated
cast.cu.cast_bf16_f32with vectorized contiguous loads.float4loads, processing 8 BF16 values per load.__bfloat162float.cast_f32_bf16with vectorized contiguous loads.float4loads, processing 4 F32 values per load.__float2bfloat16_rn.Reduce Ops
Updated
reduce.cuandmod.rs.fast_sumtemplate.__shfl_xor_sync.get_strided_index.floatfor better precision with half-precision input types.fast_sum_bf16_vec, a specialized BF16 vectorized reduce kernel using:float4loadsfast_sum_f32path using:float4loadsFAST_OPmacro path for F32 sum.fast_sum_smallkernels for small reductions whereel_to_sum <= 32.mod.rswithuse_small_reducelogic.el_to_sum <= 32now dispatch tofast_sum_smallusing 256-thread blocks instead of launching one block per output element.Unary Ops
Updated
unary.cu.ucopy_bf16with vectorized contiguous copies usingfloat4loads.usilu_bf16with vectorized contiguous processing:x / (1 + exp(-x))usigmoid_bf16with vectorized contiguous processing:1 / (1 + exp(-x))Expected Impact
These changes should improve throughput for common CUDA workloads, especially contiguous BF16 and F32 operations.
The largest expected gains are in:
el_to_sum <= 32The optimizations reduce scalar memory access, avoid unnecessary stride indexing on contiguous paths, reduce shared-memory synchronization overhead in reductions, and improve utilization through vectorized loads.