refactor: improve blt_chol_inv cache efficiency for dense matrices by antLI-dev · Pull Request #122 · linz/snap

antLI-dev · 2026-06-11T00:59:08Z

Transpose tmpcol/sumcol from [cache_slot][nrow] to [nrow][cache_slot] so the innermost summation loop accesses contiguous memory rather than incurring a stride of nrow*8 bytes per iteration. Also add -mfma to release CCPARAM to enable FMA3 scalar fused multiply-add instructions. Based on limited testing, combined runtime reduction is ~60% for a fully-dense matrix; the layout change is likely dominant.

gprof identified blt_chol_inv as the bottleneck for a 9,285-row fully-dense matrix (~43M elements), and manual timing added to the function confirmed the summation loop accounted for almost all the runtime. The old [cache_slot][nrow] layout was identified as the cause: each inner iteration stepped nrow*8 bytes between cache slots, thrashing L3 cache. AVX2 vectorisation was explored but reverted — the loop is memory-bandwidth limited so wider SIMD loads provide no benefit.

Also increase BLT_INV_CACHE_SIZE from 30 to 32 (power of 2, aligns with SIMD vector widths). All 694 regression tests pass.

Note: bltmatrx_mt.c has its own blt_load_col_cache_mt with a separate double** structure using the old [cache_slot][nrow] layout; this commit does not currently touch the MT path.

Refer to:
GSR-954

antLI-dev · 2026-06-11T21:32:11Z

https://toitutewhenua.atlassian.net/browse/GSR-954

ccrook

Nice catch. Thanks

Transpose tmpcol/sumcol from [cache_slot][nrow] to [nrow][cache_slot] so the innermost summation loop accesses contiguous memory rather than incurring a stride of nrow*8 bytes per iteration. Also add -mfma to release CCPARAM to enable FMA3 scalar fused multiply-add instructions. Based on limited testing, combined runtime reduction is ~60% for a fully-dense matrix; the layout change is likely dominant. gprof identified blt_chol_inv as the bottleneck for a 9,285-row fully-dense matrix (~43M elements), and manual timing added to the function confirmed the summation loop accounted for almost all the runtime. The old [cache_slot][nrow] layout was identified as the cause: each inner iteration stepped nrow*8 bytes between cache slots, thrashing L3 cache. AVX2 vectorisation was explored but reverted — the loop is memory-bandwidth limited so wider SIMD loads provide no benefit. Also increase BLT_INV_CACHE_SIZE from 30 to 32 (power of 2, aligns with SIMD vector widths). All 694 regression tests pass. Note: bltmatrx_mt.c has its own blt_load_col_cache_mt with a separate double** structure using the old [cache_slot][nrow] layout; this commit does not currently touch the MT path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

antLI-dev requested a review from ccrook June 11, 2026 00:59

ccrook approved these changes Jun 14, 2026

View reviewed changes

antLI-dev force-pushed the refactor/bltmatrx-cache-opt branch from c513852 to b092bd0 Compare June 15, 2026 00:53

antLI-dev merged commit f4a828c into master Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: improve blt_chol_inv cache efficiency for dense matrices#122

refactor: improve blt_chol_inv cache efficiency for dense matrices#122
antLI-dev merged 1 commit into
masterfrom
refactor/bltmatrx-cache-opt

antLI-dev commented Jun 11, 2026 •

edited by atlassian Bot

Loading

Uh oh!

antLI-dev commented Jun 11, 2026

Uh oh!

ccrook left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

antLI-dev commented Jun 11, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antLI-dev commented Jun 11, 2026

Uh oh!

ccrook left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

antLI-dev commented Jun 11, 2026 •

edited by atlassian Bot

Loading