refactor: rewrite blt_chol_inv_mt with row-partition threading and cache-efficient layout by antLI-dev · Pull Request #126 · linz/snap

antLI-dev · 2026-06-23T20:58:14Z

Replace the old model (one thread per cached column, each scanning all rows) with row-partition threading (one thread per row range, processing all cached columns simultaneously). The old model was slower than single-threaded at 4 threads.

Key changes:

Row-major flat arrays [row * stride + col_slot] for tmpcol/sumcol, matching the ST layout from commit f4a828c. The inner c-loop is sequential in memory regardless of thread count.
stride = BLT_INV_CACHE_SIZE = 32, decoupled from threadcount. The inner c-loop runs 32 Fused Multiply-Add (FMAs, enabled by -mfma compiler flag from f4a828c) per (j,k) pair regardless of how many threads are used, keeping arithmetic intensity high and amortising per-k overhead.
Work-balanced static partition: prefix sum of actual per-row element counts, plus binary search for equal-work cut-points. Prevents straggler threads on non-uniform bandwidth distributions.
Per-thread spill buffer for writes to rows outside the thread's assigned range; reduced into sumcol after join.

Benchmarks inversion-only operating on bandwidth-limited combo NGA dataset (137,801 rows, 329,855,090 elements):

  Threads   Time    Speedup   (old MT)
  1         1005s   1.00x
  4          423s   2.38x     (was 1760s, 0.57x)
  8          305s   3.30x     (was 1182s, 0.85x)

Results verified correct: max abs diff vs 1-thread baseline = 1.11022e-16.

End-to-end snap timing running on full NGA dataset (-t 8):

  Run       Wall time   CPU time
  Old MT    245m29s     1000m56s
  New MT    154m13s      309m21s
  Speedup   1.59x        3.24x less CPU

The CPU time collapse (1001m => 309m) provides perhaps the clearest signal: the old per-column model generated ~4x more CPU work than the new row-partition model at the same thread count (assumed due to cache thrashing).

Correctness check (nga.lst diff, old vs new run): residuals and adjusted coordinates are identical. 421 of ~1.8M observation lines differ, all in the uncertainty-propagation columns (computed-error, error-of-residual) at 1e-7 to 1e-8 arcsecond scale — floating-point non-determinism from MT partial-sum reduction order.

GSR-963

…che-efficient layout Replace the old model (one thread per cached column, each scanning all rows) with row-partition threading (one thread per row range, processing all cached columns simultaneously). The old model was slower than single-threaded at 4 threads. Key changes: - Row-major flat arrays [row * stride + col_slot] for tmpcol/sumcol, matching the ST layout from commit f4a828c. The inner c-loop is sequential in memory regardless of thread count. - stride = BLT_INV_CACHE_SIZE = 32, decoupled from threadcount. The inner c-loop runs 32 Fused Multiply-Add (FMAs, enabled by -mfma compiler flag from f4a828c) per (j,k) pair regardless of how many threads are used, keeping arithmetic intensity high and amortising per-k overhead. - Work-balanced static partition: prefix sum of actual per-row element counts, plus binary search for equal-work cut-points. Prevents straggler threads on non-uniform bandwidth distributions. - Per-thread spill buffer for writes to rows outside the thread's assigned range; reduced into sumcol after join. Benchmarks on combo NGA dataset (137,801 rows, 329,855,090 elements): Threads Time Speedup (old MT) 1 1005s 1.00x 4 423s 2.38x (was 1760s, 0.57x) 8 305s 3.30x (was 1182s, 0.85x) Results verified correct: max abs diff vs 1-thread baseline = 1.11022e-16. End-to-end snap timing on full NGA dataset (-t 8): Run Wall time CPU time Old MT 245m29s 1000m56s New MT 154m13s 309m21s Speedup 1.59x 3.24x less CPU The CPU time collapse (1001m => 309m) is the clearest signal: the old per-column model generated ~4x more CPU work than the new row-partition model at the same thread count due to cache thrashing. Correctness check (nga.lst diff, old vs new run): residuals and adjusted coordinates are identical. 421 of ~1.8M observation lines differ, all in the uncertainty-propagation columns (computed-error, error-of-residual) at 1e-7 to 1e-8 arcsecond scale — floating-point non-determinism from MT partial-sum reduction order.

ccrook

This one I love ... brilliant!!!

antLI-dev requested a review from ccrook June 23, 2026 20:58

ccrook approved these changes Jun 23, 2026

View reviewed changes

antLI-dev merged commit f44231f into master Jun 23, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: rewrite blt_chol_inv_mt with row-partition threading and cache-efficient layout#126

refactor: rewrite blt_chol_inv_mt with row-partition threading and cache-efficient layout#126
antLI-dev merged 1 commit into
masterfrom
refactor/bltmatrx-mt-rowpartition

antLI-dev commented Jun 23, 2026 •

edited by atlassian Bot

Loading

Uh oh!

ccrook left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

antLI-dev commented Jun 23, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccrook left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

antLI-dev commented Jun 23, 2026 •

edited by atlassian Bot

Loading