Skip to content

fix: mitigate dense bltreq matrix allocation causing OOM for large networks#123

Merged
antLI-dev merged 2 commits into
masterfrom
fix/snapspec-mitigate-dense-bltreq
Jun 17, 2026
Merged

fix: mitigate dense bltreq matrix allocation causing OOM for large networks#123
antLI-dev merged 2 commits into
masterfrom
fix/snapspec-mitigate-dense-bltreq

Conversation

@antLI-dev

Copy link
Copy Markdown
Contributor

create_bltmatrix() sets nsparse=0. blt_requested_size() reads nsparse=0 and sets all r->req=0 as a side effect, destroying the values accumulated by the pair-loop. copy_bltmatrix() then calls copy_bltmatrix_bandwidth() which sets nsparse=nrow, followed by alloc_bltrow_arrays() which allocates i+1-r->req elements per row — full width since r->req=0. For the NZ network (122,317 stations, 137,801-row matrix) this produces a ~52 GB allocation.

Fix: call blt_set_sparse_rows(bltreq, nrow) immediately after creation. This one-line fix sets nsparse=nrow so blt_requested_size() does not fire the destructive branch and the pair-loop's req values are preserved.

The resulting row widths are still correct. The pair-loop calls blt_nonzero_element() for each station pair that needs a covariance element, which only ever decreases req (widens a row), so requests accumulate correctly into bltreq regardless of nsparse. copy_bltmatrix() in relacc_calc_requested_covar() then seeds bltreq with bltdec's bandwidth via copy_bltmatrix_bandwidth() before allocating, so the final blt is allocated at exactly UNION(pair-loop requests, bltdec bandwidth) — neither full-width nor diagonal-only.

A follow-up refactor (refactor/snapspec-single-cholesky-creation) will avoid holding the Cholesky factor (bltdec) in memory alongside the inverted covariance matrix (blt), reducing peak memory further.

…tworks

create_bltmatrix() sets nsparse=0. blt_requested_size() reads nsparse=0 and
sets all r->req=0 as a side effect, destroying the values accumulated by the
pair-loop. copy_bltmatrix() then calls copy_bltmatrix_bandwidth() which sets
nsparse=nrow, followed by alloc_bltrow_arrays() which allocates i+1-r->req
elements per row — full width since r->req=0. For the NZ network (122,317
stations, 137,801-row matrix) this produces a ~52 GB allocation.

Fix: call blt_set_sparse_rows(bltreq, nrow) immediately after creation.
This one-line fix sets nsparse=nrow so blt_requested_size() does not fire the
destructive branch and the pair-loop's req values are preserved.

The resulting row widths are still correct. The pair-loop calls
blt_nonzero_element() for each station pair that needs a covariance element,
which only ever decreases req (widens a row), so requests accumulate correctly
into bltreq regardless of nsparse. copy_bltmatrix() in
relacc_calc_requested_covar() then seeds bltreq with bltdec's bandwidth via
copy_bltmatrix_bandwidth() before allocating, so the final blt is allocated
at exactly UNION(pair-loop requests, bltdec bandwidth) — neither full-width
nor diagonal-only.

A follow-up refactor (refactor/snapspec-single-cholesky-creation) will avoid
holding the Cholesky factor (bltdec) in memory alongside the inverted
covariance matrix (blt), reducing peak memory further.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@antLI-dev antLI-dev requested a review from ccrook June 15, 2026 01:10
The "Matrix size" log line previously showed 55 elements (100.00% full)
because blt_requested_size() had a destructive side effect: with nsparse=0
(the default from create_bltmatrix), it zeroed all req values before
counting, making every row appear full-width.

The fix in this branch (blt_set_sparse_rows(bltreq, nrow) immediately after
create_bltmatrix) sets nsparse=nrow, so blt_requested_size() no longer fires
the zeroing branch. The logged size now reflects the actual pair-loop
bandwidth: 16 elements (29.09% full) for these test cases.

The station order assignments and all other output are unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@ccrook ccrook left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - thanks!

@antLI-dev antLI-dev merged commit dcdeac5 into master Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants