Skip to content

refactor: rewrite blt_chol_inv_mt with row-partition threading and cache-efficient layout#126

Merged
antLI-dev merged 1 commit into
masterfrom
refactor/bltmatrx-mt-rowpartition
Jun 23, 2026
Merged

refactor: rewrite blt_chol_inv_mt with row-partition threading and cache-efficient layout#126
antLI-dev merged 1 commit into
masterfrom
refactor/bltmatrx-mt-rowpartition

Conversation

@antLI-dev

@antLI-dev antLI-dev commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Replace the old model (one thread per cached column, each scanning all rows) with row-partition threading (one thread per row range, processing all cached columns simultaneously). The old model was slower than single-threaded at 4 threads.

Key changes:

  • Row-major flat arrays [row * stride + col_slot] for tmpcol/sumcol, matching the ST layout from commit f4a828c. The inner c-loop is sequential in memory regardless of thread count.

  • stride = BLT_INV_CACHE_SIZE = 32, decoupled from threadcount. The inner c-loop runs 32 Fused Multiply-Add (FMAs, enabled by -mfma compiler flag from f4a828c) per (j,k) pair regardless of how many threads are used, keeping arithmetic intensity high and amortising per-k overhead.

  • Work-balanced static partition: prefix sum of actual per-row element counts, plus binary search for equal-work cut-points. Prevents straggler threads on non-uniform bandwidth distributions.

  • Per-thread spill buffer for writes to rows outside the thread's assigned range; reduced into sumcol after join.

Benchmarks inversion-only operating on bandwidth-limited combo NGA dataset (137,801 rows, 329,855,090 elements):

  Threads   Time    Speedup   (old MT)
  1         1005s   1.00x
  4          423s   2.38x     (was 1760s, 0.57x)
  8          305s   3.30x     (was 1182s, 0.85x)

Results verified correct: max abs diff vs 1-thread baseline = 1.11022e-16.

End-to-end snap timing running on full NGA dataset (-t 8):

  Run       Wall time   CPU time
  Old MT    245m29s     1000m56s
  New MT    154m13s      309m21s
  Speedup   1.59x        3.24x less CPU

The CPU time collapse (1001m => 309m) provides perhaps the clearest signal: the old per-column model generated ~4x more CPU work than the new row-partition model at the same thread count (assumed due to cache thrashing).

Correctness check (nga.lst diff, old vs new run): residuals and adjusted coordinates are identical. 421 of ~1.8M observation lines differ, all in the uncertainty-propagation columns (computed-error, error-of-residual) at 1e-7 to 1e-8 arcsecond scale — floating-point non-determinism from MT partial-sum reduction order.

GSR-963

…che-efficient layout

Replace the old model (one thread per cached column, each scanning all rows)
with row-partition threading (one thread per row range, processing all cached
columns simultaneously).  The old model was slower than single-threaded at 4
threads.

Key changes:

- Row-major flat arrays [row * stride + col_slot] for tmpcol/sumcol, matching
  the ST layout from commit f4a828c.  The inner c-loop is sequential in
  memory regardless of thread count.

- stride = BLT_INV_CACHE_SIZE = 32, decoupled from threadcount.  The inner
  c-loop runs 32 Fused Multiply-Add (FMAs, enabled by -mfma compiler flag
  from f4a828c) per (j,k) pair regardless of how many threads are used,
  keeping arithmetic intensity high and amortising per-k overhead.

- Work-balanced static partition: prefix sum of actual per-row element counts,
  plus binary search for equal-work cut-points.  Prevents straggler threads on
  non-uniform bandwidth distributions.

- Per-thread spill buffer for writes to rows outside the thread's assigned
  range; reduced into sumcol after join.

Benchmarks on combo NGA dataset (137,801 rows, 329,855,090 elements):

  Threads   Time    Speedup   (old MT)
  1         1005s   1.00x
  4          423s   2.38x     (was 1760s, 0.57x)
  8          305s   3.30x     (was 1182s, 0.85x)

Results verified correct: max abs diff vs 1-thread baseline = 1.11022e-16.

End-to-end snap timing on full NGA dataset (-t 8):

  Run       Wall time   CPU time
  Old MT    245m29s     1000m56s
  New MT    154m13s      309m21s
  Speedup   1.59x        3.24x less CPU

The CPU time collapse (1001m => 309m) is the clearest signal: the old
per-column model generated ~4x more CPU work than the new row-partition
model at the same thread count due to cache thrashing.

Correctness check (nga.lst diff, old vs new run): residuals and adjusted
coordinates are identical.  421 of ~1.8M observation lines differ, all in
the uncertainty-propagation columns (computed-error, error-of-residual) at
1e-7 to 1e-8 arcsecond scale — floating-point non-determinism from MT
partial-sum reduction order.
@antLI-dev antLI-dev requested a review from ccrook June 23, 2026 20:58

@ccrook ccrook left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I love ... brilliant!!!

@antLI-dev antLI-dev merged commit f44231f into master Jun 23, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants