Skip to content

refactor: improve blt_chol_inv cache efficiency for dense matrices#122

Merged
antLI-dev merged 1 commit into
masterfrom
refactor/bltmatrx-cache-opt
Jun 15, 2026
Merged

refactor: improve blt_chol_inv cache efficiency for dense matrices#122
antLI-dev merged 1 commit into
masterfrom
refactor/bltmatrx-cache-opt

Conversation

@antLI-dev

@antLI-dev antLI-dev commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Transpose tmpcol/sumcol from [cache_slot][nrow] to [nrow][cache_slot] so the innermost summation loop accesses contiguous memory rather than incurring a stride of nrow*8 bytes per iteration. Also add -mfma to release CCPARAM to enable FMA3 scalar fused multiply-add instructions. Based on limited testing, combined runtime reduction is ~60% for a fully-dense matrix; the layout change is likely dominant.

gprof identified blt_chol_inv as the bottleneck for a 9,285-row fully-dense matrix (~43M elements), and manual timing added to the function confirmed the summation loop accounted for almost all the runtime. The old [cache_slot][nrow] layout was identified as the cause: each inner iteration stepped nrow*8 bytes between cache slots, thrashing L3 cache. AVX2 vectorisation was explored but reverted — the loop is memory-bandwidth limited so wider SIMD loads provide no benefit.

Also increase BLT_INV_CACHE_SIZE from 30 to 32 (power of 2, aligns with SIMD vector widths). All 694 regression tests pass.

Note: bltmatrx_mt.c has its own blt_load_col_cache_mt with a separate double** structure using the old [cache_slot][nrow] layout; this commit does not currently touch the MT path.

Refer to:
GSR-954

@antLI-dev antLI-dev requested a review from ccrook June 11, 2026 00:59
@antLI-dev

Copy link
Copy Markdown
Contributor Author

@ccrook ccrook left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thanks

Transpose tmpcol/sumcol from [cache_slot][nrow] to [nrow][cache_slot]
so the innermost summation loop accesses contiguous memory rather than
incurring a stride of nrow*8 bytes per iteration.  Also add -mfma to
release CCPARAM to enable FMA3 scalar fused multiply-add instructions.
Based on limited testing, combined runtime reduction is ~60% for a
fully-dense matrix; the layout change is likely dominant.

gprof identified blt_chol_inv as the bottleneck for a 9,285-row
fully-dense matrix (~43M elements), and manual timing added to the
function confirmed the summation loop accounted for almost all the
runtime.  The old [cache_slot][nrow] layout was identified as the cause:
each inner iteration stepped nrow*8 bytes between cache slots, thrashing
L3 cache.  AVX2 vectorisation was explored but reverted — the loop is
memory-bandwidth limited so wider SIMD loads provide no benefit.

Also increase BLT_INV_CACHE_SIZE from 30 to 32 (power of 2, aligns
with SIMD vector widths).  All 694 regression tests pass.

Note: bltmatrx_mt.c has its own blt_load_col_cache_mt with a separate
double** structure using the old [cache_slot][nrow] layout; this commit
does not currently touch the MT path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@antLI-dev antLI-dev force-pushed the refactor/bltmatrx-cache-opt branch from c513852 to b092bd0 Compare June 15, 2026 00:53
@antLI-dev antLI-dev merged commit f4a828c into master Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants