Optimize compactSelection: 25-50% wall-time reduction in GLIMPSE2_phase, byte-identical results#291
Optimize compactSelection: 25-50% wall-time reduction in GLIMPSE2_phase, byte-identical results#291tfenne wants to merge 3 commits into
Conversation
Profiling GLIMPSE2_phase on a single 1.9x-coverage diploid sample with
~1984 reference haplotypes (--Kpbwt 2000, so the iota selection path is
taken every burn/main iteration) showed conditioning_set::compactSelection
consuming ~35% of wall time on x86 -- more than any individual HMM kernel.
The dominant cost was the inner loop that builds the Hvar bitmatrix:
for (int k = 0 ; k < idxHaps_ref.size() ; k++)
Hvar.set(lrel, k, H.HvarRef.get(lcom, idxHaps_ref[k]));
Each set() does a read-modify-write on a single byte -- 8 RMW byte ops per
output byte, ~860M byte ops per call.
Packing 8 selected reference bits into one byte and storing it directly via
a new bitmatrix::setByte helper turns those 8 RMW ops into 8 reads and 1
store. The output byte content is identical, so the final VCF/BCF is
bit-exact against the previous code.
Trailing bits when n_states is not a multiple of 8 still go through the
original set() path so any pre-existing content in the unused tail of the
last byte of each row is left untouched.
Measured wall-time reductions on a single chrA1 chunk, end-to-end:
* x86 (Intel Xeon Granite Rapids, native AVX2): 129s -> 96s (~26%)
* arm64 (Apple Silicon, AVX2 -> NEON via simde): 140s -> 79s (~44%)
The arm64 win is larger because simde turns each 256-bit AVX2 op into two
128-bit NEON ops, which inflates the relative weight of the (scalar)
compactSelection bit-pack work versus the SIMD HMM kernels.
Validated byte-identical output (bcftools view -H | md5sum) across 5
chunks (chrA1/A2/B2/C1/D1) on both architectures.
When --Kpbwt is >= the number of reference haplotypes, compactSelection takes the iota branch and idxHaps_ref ends up holding every reference haplotype index (0..n_ref_haps-1). In that case the derived state -- var_type, polymorphic_sites, monomorphic_sites, Svar, and the Hvar bitmatrix -- depends only on the reference panel, not on the individual or the iteration. The same is true of the transition vectors t and nt. GLIMPSE2_phase still rebuilds all of that work on every iteration (~21 calls per chunk for typical 5+15 burn/main settings). After the prior commit reduced compactSelection's per-call cost, the function still sat at ~12% of x86 wall and was still doing that fully-redundant work. Cache it: track the n_ref_haps value the full-panel state was last built for, and short-circuit compactSelection when the next call would produce the same selection. updateTransitions short-circuits via a parallel flag that's cleared whenever compactSelection rebuilds. The cache is always invalidated when the use_list path runs (init_states, PBWT-selected subsets, or any future selection mode), so it only short- circuits when the next call's idxHaps_ref would be byte-identical to the last one. Output is byte-exact against the previous code. The cache assumes H is immutable for the lifetime of this conditioning_set, which holds today (one conditioning_set per worker thread, allocated for the duration of a chunk and never reseated to a different panel). Wall-time on the same 5 cat-genome chunks, single 1.9x diploid, --Kpbwt 2000: * x86 (Intel Xeon Granite Rapids): 96s -> 84s (additional ~12% over prior commit; ~35% vs baseline) * arm64 (Apple Silicon, simde): 79s -> 66s (additional ~16% over prior commit; ~53% vs baseline) This optimization only fires when --Kpbwt >= n_ref_haps. Runs that select a PBWT subset (--Kpbwt < n_ref_haps) keep their existing per-iteration recompute path -- the cache in that case would always miss, so the only overhead is one integer compare and one flag check per call.
phasingProb, phasingProbSum, imputeProb and imputeProbSum are each fully overwritten via std::copy from prob/probSumH inside backward() before forward() (or IMPUTE_FLAT_HET) ever read them. The std::fill(...,0.0f) calls on those four buffers are dead work. Profiling showed __memset_avx512_unaligned_erms + __memmove combined at ~5% on x86 post-cache, of which a portion was these fills. Removing them is byte-identical against the previous code (verified across 5 chunks) and trims a small but free amount of overhead.
|
Thank you! Will have to benchmark this a bit (might take a bit of time, the UKB platform is down at the moment), but this sounds amazing! 👍 |
|
Thanks @srubinacci. No rush on my end - I figured it might take a while to review. For my own use, I'm running GLIMPSE via docker using a custom container that has some of my changes merged down into a single branch/tag. That said, it would be nice to work towards a GLIMPSE release (2.0.1? 2.1?) some time over the summer, and update what's available in bioconda etc. I'm happy to help with that however I can. |
|
Yes, I do a agree we should work towards a v2.1 release! |
|
@srubinacci anything I can do to help move this and my other PRs along? I don't have access to UKBB, but I'd be happy to run larger scale concordance on any public dataset and share scripts/pipelines and results. |
|
Hi! The key point is that I did not forget about this, it's just waiting for the UKB to become available. |
|
Thanks Simone! Yeah - I see the current timeline for UKB-RAP is "we'll communicate in June what the timeline is for re-opening", which I read as we'll get an update, but probably not access in June. I wonder would it be sufficient to do some benchmarking using HRC 1.1? I don't have access to that, but I could submit an application today if you felt like that ~64k haplotype panel would be enough to give confidence in the changes. |
Preamble
The changes in this PR yield real and substantial savings in GLIMPSE2_phase runtimes in my hands, on the order of 25-50% reduction in runtime. I've done my best to benchmark what I think are representative uses of GLIMPSE2_phase, but I'm also aware that the range of panels sizes and compositions, and phasing settings are likely significantly broader than I've tested.
These changes have been tested and shown to yield similar reductions in runtime on the following samples:
Summary
conditioning_set::compactSelectionwas the single biggest hotspot inGLIMPSE2_phaseon the workloads I profiled — about 35% of wall-time on x86 and about 25% on arm64 (where simde halves the AVX2 throughput, inflating the relative cost of the surrounding HMM kernels). This PR introduces two small, isolated changes tocompactSelectionplus one trivial dead-code removal. End-to-end wall-time drops by 25-50% depending on coverage, panel size, and--Kpbwt. Output is byte-identical tomasteracross every workload I tested.What this is, and what it is not
This PR does not touch any HMM kernel, any floating-point operation, any emission/transition computation, or any RNG draw. The output is determined entirely by the HMM kernels and those are byte-identical. The only changes are:
Hvarbitmatrix, by writing 8 packed bits at once;compactSelection's output when the conditioning set is the entire reference panel (the iota path that fires by default whenever--Kpbwt >= n_ref_haps);std::fill(...,0.0f)calls inphasing_hmm::reallocatewhose targets are unconditionally overwritten bystd::copybefore they are ever read.The three commits
1.
Pack 8 bits per byte when building Hvar in compactSelectionThe hot inner loop was:
Each
set()is a read-modify-write on a single byte, so we do 8 RMW ops per output byte (~860M byte ops per call on a typical chunk). Replacing this with 8 reads + 1 byte store via a newbitmatrix::setBytehelper produces the same bit pattern with a single cache-friendly write. Trailing bits whenn_statesis not a multiple of 8 still go throughset()so any pre-existing content in unused tail bits of the row's last byte is left untouched.2.
Cache compactSelection result for the full-panel selection pathWhen
--Kpbwt >= n_ref_haps,idxHaps_refends up asiota(0, n_ref_haps)every burn/main iteration. The derived state (var_type,polymorphic_sites,monomorphic_sites,Svar,Hvar,t,nt) depends only on the reference panel — not on the individual or the iteration. The code currently rebuilds all of it on every call (~21 calls per chunk for the default 5+15 burn/main settings).This commit caches the result, keyed on
n_ref_haps. The cache is invalidated whenever any non-iota selection path runs (STAGE_INITwithKinit > 0,STAGE_RESTRICT, the PBWT subset path, or thelist_statespost-processing). When--Kpbwt < n_ref_haps, the cache is never populated and the only overhead is one integer compare and one bool check per call.The cache assumes
His immutable for the lifetime of thisconditioning_set, which holds today — each worker thread gets aconditioning_setallocated once and never reseated to a different panel. I documented this invariant in the header.3.
Remove dead zero-fills in phasing_hmm::reallocatephasingProb,phasingProbSum,imputeProb, andimputeProbSumare each fully overwritten bystd::copyfromprob/probSumHinsidebackward()beforeforward()(orIMPUTE_FLAT_HET()) reads them. The fourstd::fillcalls were dead. Trivial.Performance
All numbers are wall-time on a single chunk, single thread, end-to-end (binary entry to exit). Verified byte-identical via
bcftools view -H | md5sumon every chunk listed.Cat genome reference panel (Nrh=1984, L=434360 sites), single individual
GLIMPSE2 tutorial (NA12878 1x, 1000GP-no-NA12878, 16 chr22 chunks)
This is the public dataset under
tutorial/. All 16 chunks PASSbcftools view -H | md5sumbyte-identity.The 1000GP-vs-cat gap (33.8% vs ~53% on arm64) is because
--Kpbwt 2000<Nrh=5006for the 1000GP panel, so the cache (commit 2) never fires; only the bit-pack (commit 1) is active.Profile shape (chrA1, x86, default --Kpbwt 2000)
conditioning_set::compactSelectionimputation_hmm::backwardphasing_hmm::forwardphasing_hmm::backwardimputation_hmm::forwardcompactSelectionis effectively eliminated. What remains is genuine HMM compute that is already heavily SIMD-optimized.Caveats and notes
--Kpbwt < n_ref_haps: commit 2's cache never fires; commit 1 still does. Speedup drops to ~25-30% on x86. No functional change.conditioning_set::clear()is declared but has no implementation anywhere in the tree (pre-existing). I did not add one. If this method is ever defined and used to reset the object for reuse, it must resetcached_full_panel_n = 0andtransitions_valid = false. This is documented in a comment on the declaration.bcftools view -H | md5sumis the validation I relied on.