refactor: rewrite blt_chol_inv_mt with row-partition threading and cache-efficient layout#126
Merged
Merged
Conversation
…che-efficient layout Replace the old model (one thread per cached column, each scanning all rows) with row-partition threading (one thread per row range, processing all cached columns simultaneously). The old model was slower than single-threaded at 4 threads. Key changes: - Row-major flat arrays [row * stride + col_slot] for tmpcol/sumcol, matching the ST layout from commit f4a828c. The inner c-loop is sequential in memory regardless of thread count. - stride = BLT_INV_CACHE_SIZE = 32, decoupled from threadcount. The inner c-loop runs 32 Fused Multiply-Add (FMAs, enabled by -mfma compiler flag from f4a828c) per (j,k) pair regardless of how many threads are used, keeping arithmetic intensity high and amortising per-k overhead. - Work-balanced static partition: prefix sum of actual per-row element counts, plus binary search for equal-work cut-points. Prevents straggler threads on non-uniform bandwidth distributions. - Per-thread spill buffer for writes to rows outside the thread's assigned range; reduced into sumcol after join. Benchmarks on combo NGA dataset (137,801 rows, 329,855,090 elements): Threads Time Speedup (old MT) 1 1005s 1.00x 4 423s 2.38x (was 1760s, 0.57x) 8 305s 3.30x (was 1182s, 0.85x) Results verified correct: max abs diff vs 1-thread baseline = 1.11022e-16. End-to-end snap timing on full NGA dataset (-t 8): Run Wall time CPU time Old MT 245m29s 1000m56s New MT 154m13s 309m21s Speedup 1.59x 3.24x less CPU The CPU time collapse (1001m => 309m) is the clearest signal: the old per-column model generated ~4x more CPU work than the new row-partition model at the same thread count due to cache thrashing. Correctness check (nga.lst diff, old vs new run): residuals and adjusted coordinates are identical. 421 of ~1.8M observation lines differ, all in the uncertainty-propagation columns (computed-error, error-of-residual) at 1e-7 to 1e-8 arcsecond scale — floating-point non-determinism from MT partial-sum reduction order.
ccrook
approved these changes
Jun 23, 2026
ccrook
left a comment
Contributor
There was a problem hiding this comment.
This one I love ... brilliant!!!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace the old model (one thread per cached column, each scanning all rows) with row-partition threading (one thread per row range, processing all cached columns simultaneously). The old model was slower than single-threaded at 4 threads.
Key changes:
Row-major flat arrays [row * stride + col_slot] for tmpcol/sumcol, matching the ST layout from commit f4a828c. The inner c-loop is sequential in memory regardless of thread count.
stride =
BLT_INV_CACHE_SIZE = 32, decoupled from threadcount. The inner c-loop runs 32 Fused Multiply-Add (FMAs, enabled by-mfmacompiler flag from f4a828c) per (j,k) pair regardless of how many threads are used, keeping arithmetic intensity high and amortising per-k overhead.Work-balanced static partition: prefix sum of actual per-row element counts, plus binary search for equal-work cut-points. Prevents straggler threads on non-uniform bandwidth distributions.
Per-thread spill buffer for writes to rows outside the thread's assigned range; reduced into sumcol after join.
Benchmarks inversion-only operating on bandwidth-limited combo NGA dataset (137,801 rows, 329,855,090 elements):
Results verified correct: max abs diff vs 1-thread baseline = 1.11022e-16.
End-to-end
snaptiming running on full NGA dataset (-t 8):The CPU time collapse (1001m => 309m) provides perhaps the clearest signal: the old per-column model generated ~4x more CPU work than the new row-partition model at the same thread count (assumed due to cache thrashing).
Correctness check (nga.lst diff, old vs new run): residuals and adjusted coordinates are identical. 421 of ~1.8M observation lines differ, all in the uncertainty-propagation columns (computed-error, error-of-residual) at 1e-7 to 1e-8 arcsecond scale — floating-point non-determinism from MT partial-sum reduction order.
GSR-963