refactor: improve blt_chol_inv cache efficiency for dense matrices#122
Merged
Conversation
Contributor
Author
Transpose tmpcol/sumcol from [cache_slot][nrow] to [nrow][cache_slot] so the innermost summation loop accesses contiguous memory rather than incurring a stride of nrow*8 bytes per iteration. Also add -mfma to release CCPARAM to enable FMA3 scalar fused multiply-add instructions. Based on limited testing, combined runtime reduction is ~60% for a fully-dense matrix; the layout change is likely dominant. gprof identified blt_chol_inv as the bottleneck for a 9,285-row fully-dense matrix (~43M elements), and manual timing added to the function confirmed the summation loop accounted for almost all the runtime. The old [cache_slot][nrow] layout was identified as the cause: each inner iteration stepped nrow*8 bytes between cache slots, thrashing L3 cache. AVX2 vectorisation was explored but reverted — the loop is memory-bandwidth limited so wider SIMD loads provide no benefit. Also increase BLT_INV_CACHE_SIZE from 30 to 32 (power of 2, aligns with SIMD vector widths). All 694 regression tests pass. Note: bltmatrx_mt.c has its own blt_load_col_cache_mt with a separate double** structure using the old [cache_slot][nrow] layout; this commit does not currently touch the MT path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
c513852 to
b092bd0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Transpose tmpcol/sumcol from [cache_slot][nrow] to [nrow][cache_slot] so the innermost summation loop accesses contiguous memory rather than incurring a stride of nrow*8 bytes per iteration. Also add -mfma to release CCPARAM to enable FMA3 scalar fused multiply-add instructions. Based on limited testing, combined runtime reduction is ~60% for a fully-dense matrix; the layout change is likely dominant.
gprof identified blt_chol_inv as the bottleneck for a 9,285-row fully-dense matrix (~43M elements), and manual timing added to the function confirmed the summation loop accounted for almost all the runtime. The old [cache_slot][nrow] layout was identified as the cause: each inner iteration stepped nrow*8 bytes between cache slots, thrashing L3 cache. AVX2 vectorisation was explored but reverted — the loop is memory-bandwidth limited so wider SIMD loads provide no benefit.
Also increase BLT_INV_CACHE_SIZE from 30 to 32 (power of 2, aligns with SIMD vector widths). All 694 regression tests pass.
Note: bltmatrx_mt.c has its own blt_load_col_cache_mt with a separate double** structure using the old [cache_slot][nrow] layout; this commit does not currently touch the MT path.
Refer to:
GSR-954