Optimize compactSelection: 25-50% wall-time reduction in GLIMPSE2_phase, byte-identical results by tfenne · Pull Request #291 · odelaneau/GLIMPSE

tfenne · 2026-05-05T11:03:39Z

Preamble

The changes in this PR yield real and substantial savings in GLIMPSE2_phase runtimes in my hands, on the order of 25-50% reduction in runtime. I've done my best to benchmark what I think are representative uses of GLIMPSE2_phase, but I'm also aware that the range of panels sizes and compositions, and phasing settings are likely significantly broader than I've tested.

These changes have been tested and shown to yield similar reductions in runtime on the following samples:

The example 1000G data in the tutorial
a ~2X coverage sample with a ~1950 hap panel, with -Kpbwt at 2000, and again at 1900
a ~0.2X coverage sample with a ~1950 hap panel, with -Kpbwt at 2000, and again at 1900

Summary

conditioning_set::compactSelection was the single biggest hotspot in GLIMPSE2_phase on the workloads I profiled — about 35% of wall-time on x86 and about 25% on arm64 (where simde halves the AVX2 throughput, inflating the relative cost of the surrounding HMM kernels). This PR introduces two small, isolated changes to compactSelection plus one trivial dead-code removal. End-to-end wall-time drops by 25-50% depending on coverage, panel size, and --Kpbwt. Output is byte-identical to master across every workload I tested.

What this is, and what it is not

This PR does not touch any HMM kernel, any floating-point operation, any emission/transition computation, or any RNG draw. The output is determined entirely by the HMM kernels and those are byte-identical. The only changes are:

avoiding 8 byte read-modify-writes per output byte when constructing the Hvar bitmatrix, by writing 8 packed bits at once;
skipping a fully-redundant rebuild of compactSelection's output when the conditioning set is the entire reference panel (the iota path that fires by default whenever --Kpbwt >= n_ref_haps);
removing four std::fill(...,0.0f) calls in phasing_hmm::reallocate whose targets are unconditionally overwritten by std::copy before they are ever read.

The three commits

1. Pack 8 bits per byte when building Hvar in compactSelection

The hot inner loop was:

for (int k = 0; k < idxHaps_ref.size(); k++)
    Hvar.set(lrel, k, H.HvarRef.get(lcom, idxHaps_ref[k]));

Each set() is a read-modify-write on a single byte, so we do 8 RMW ops per output byte (~860M byte ops per call on a typical chunk). Replacing this with 8 reads + 1 byte store via a new bitmatrix::setByte helper produces the same bit pattern with a single cache-friendly write. Trailing bits when n_states is not a multiple of 8 still go through set() so any pre-existing content in unused tail bits of the row's last byte is left untouched.

2. Cache compactSelection result for the full-panel selection path

When --Kpbwt >= n_ref_haps, idxHaps_ref ends up as iota(0, n_ref_haps) every burn/main iteration. The derived state (var_type, polymorphic_sites, monomorphic_sites, Svar, Hvar, t, nt) depends only on the reference panel — not on the individual or the iteration. The code currently rebuilds all of it on every call (~21 calls per chunk for the default 5+15 burn/main settings).

This commit caches the result, keyed on n_ref_haps. The cache is invalidated whenever any non-iota selection path runs (STAGE_INIT with Kinit > 0, STAGE_RESTRICT, the PBWT subset path, or the list_states post-processing). When --Kpbwt < n_ref_haps, the cache is never populated and the only overhead is one integer compare and one bool check per call.

The cache assumes H is immutable for the lifetime of this conditioning_set, which holds today — each worker thread gets a conditioning_set allocated once and never reseated to a different panel. I documented this invariant in the header.

3. Remove dead zero-fills in phasing_hmm::reallocate

phasingProb, phasingProbSum, imputeProb, and imputeProbSum are each fully overwritten by std::copy from prob/probSumH inside backward() before forward() (or IMPUTE_FLAT_HET()) reads them. The four std::fill calls were dead. Trivial.

Performance

All numbers are wall-time on a single chunk, single thread, end-to-end (binary entry to exit). Verified byte-identical via bcftools view -H | md5sum on every chunk listed.

Cat genome reference panel (Nrh=1984, L=434360 sites), single individual

Workload	Master	Branch	Reduction
1.9x BAM, --Kpbwt 2000 (default), 5 chunks mean	x86: 130s arm64: 140s	x86: 84s arm64: 66s	x86 ~35% arm64 ~53%
0.2x BAM, --Kpbwt 2000, chrA1	x86: 105s	x86: 58s	x86 ~45%
1.9x BAM, --Kpbwt 1900 (PBWT subset), chrA1	x86: 74s arm64: 78s	x86: 54s arm64: 41s	x86 ~27% arm64 ~47%

GLIMPSE2 tutorial (NA12878 1x, 1000GP-no-NA12878, 16 chr22 chunks)

This is the public dataset under tutorial/. All 16 chunks PASS bcftools view -H | md5sum byte-identity.

	Master	Branch	Reduction
arm64, full step5 (16 chunks)	278s	184s	33.8%

The 1000GP-vs-cat gap (33.8% vs ~53% on arm64) is because --Kpbwt 2000 < Nrh=5006 for the 1000GP panel, so the cache (commit 2) never fires; only the bit-pack (commit 1) is active.

Profile shape (chrA1, x86, default --Kpbwt 2000)

Function	Master	After this PR
`conditioning_set::compactSelection`	34.9%	1.0%
`imputation_hmm::backward`	18.0%	27.4%
`phasing_hmm::forward`	17.6%	27.2%
`phasing_hmm::backward`	14.4%	21.8%
`imputation_hmm::forward`	9.6%	14.5%

compactSelection is effectively eliminated. What remains is genuine HMM compute that is already heavily SIMD-optimized.

Caveats and notes

--Kpbwt < n_ref_haps: commit 2's cache never fires; commit 1 still does. Speedup drops to ~25-30% on x86. No functional change.
conditioning_set::clear() is declared but has no implementation anywhere in the tree (pre-existing). I did not add one. If this method is ever defined and used to reset the object for reuse, it must reset cached_full_panel_n = 0 and transitions_valid = false. This is documented in a comment on the declaration.
No new unit tests, in keeping with the project's existing test approach (end-to-end tutorial scripts as ground truth). The byte-identity check via bcftools view -H | md5sum is the validation I relied on.
Builds clean on Linux/x86_64 (gcc 11, native AVX2) and macOS/arm64 (clang, NEON via simde).

Profiling GLIMPSE2_phase on a single 1.9x-coverage diploid sample with ~1984 reference haplotypes (--Kpbwt 2000, so the iota selection path is taken every burn/main iteration) showed conditioning_set::compactSelection consuming ~35% of wall time on x86 -- more than any individual HMM kernel. The dominant cost was the inner loop that builds the Hvar bitmatrix: for (int k = 0 ; k < idxHaps_ref.size() ; k++) Hvar.set(lrel, k, H.HvarRef.get(lcom, idxHaps_ref[k])); Each set() does a read-modify-write on a single byte -- 8 RMW byte ops per output byte, ~860M byte ops per call. Packing 8 selected reference bits into one byte and storing it directly via a new bitmatrix::setByte helper turns those 8 RMW ops into 8 reads and 1 store. The output byte content is identical, so the final VCF/BCF is bit-exact against the previous code. Trailing bits when n_states is not a multiple of 8 still go through the original set() path so any pre-existing content in the unused tail of the last byte of each row is left untouched. Measured wall-time reductions on a single chrA1 chunk, end-to-end: * x86 (Intel Xeon Granite Rapids, native AVX2): 129s -> 96s (~26%) * arm64 (Apple Silicon, AVX2 -> NEON via simde): 140s -> 79s (~44%) The arm64 win is larger because simde turns each 256-bit AVX2 op into two 128-bit NEON ops, which inflates the relative weight of the (scalar) compactSelection bit-pack work versus the SIMD HMM kernels. Validated byte-identical output (bcftools view -H | md5sum) across 5 chunks (chrA1/A2/B2/C1/D1) on both architectures.

When --Kpbwt is >= the number of reference haplotypes, compactSelection takes the iota branch and idxHaps_ref ends up holding every reference haplotype index (0..n_ref_haps-1). In that case the derived state -- var_type, polymorphic_sites, monomorphic_sites, Svar, and the Hvar bitmatrix -- depends only on the reference panel, not on the individual or the iteration. The same is true of the transition vectors t and nt. GLIMPSE2_phase still rebuilds all of that work on every iteration (~21 calls per chunk for typical 5+15 burn/main settings). After the prior commit reduced compactSelection's per-call cost, the function still sat at ~12% of x86 wall and was still doing that fully-redundant work. Cache it: track the n_ref_haps value the full-panel state was last built for, and short-circuit compactSelection when the next call would produce the same selection. updateTransitions short-circuits via a parallel flag that's cleared whenever compactSelection rebuilds. The cache is always invalidated when the use_list path runs (init_states, PBWT-selected subsets, or any future selection mode), so it only short- circuits when the next call's idxHaps_ref would be byte-identical to the last one. Output is byte-exact against the previous code. The cache assumes H is immutable for the lifetime of this conditioning_set, which holds today (one conditioning_set per worker thread, allocated for the duration of a chunk and never reseated to a different panel). Wall-time on the same 5 cat-genome chunks, single 1.9x diploid, --Kpbwt 2000: * x86 (Intel Xeon Granite Rapids): 96s -> 84s (additional ~12% over prior commit; ~35% vs baseline) * arm64 (Apple Silicon, simde): 79s -> 66s (additional ~16% over prior commit; ~53% vs baseline) This optimization only fires when --Kpbwt >= n_ref_haps. Runs that select a PBWT subset (--Kpbwt < n_ref_haps) keep their existing per-iteration recompute path -- the cache in that case would always miss, so the only overhead is one integer compare and one flag check per call.

phasingProb, phasingProbSum, imputeProb and imputeProbSum are each fully overwritten via std::copy from prob/probSumH inside backward() before forward() (or IMPUTE_FLAT_HET) ever read them. The std::fill(...,0.0f) calls on those four buffers are dead work. Profiling showed __memset_avx512_unaligned_erms + __memmove combined at ~5% on x86 post-cache, of which a portion was these fills. Removing them is byte-identical against the previous code (verified across 5 chunks) and trims a small but free amount of overhead.

srubinacci · 2026-05-06T06:12:00Z

Thank you! Will have to benchmark this a bit (might take a bit of time, the UKB platform is down at the moment), but this sounds amazing! 👍

…eduction)

tfenne · 2026-05-09T11:44:29Z

Thanks @srubinacci. No rush on my end - I figured it might take a while to review. For my own use, I'm running GLIMPSE via docker using a custom container that has some of my changes merged down into a single branch/tag. That said, it would be nice to work towards a GLIMPSE release (2.0.1? 2.1?) some time over the summer, and update what's available in bioconda etc. I'm happy to help with that however I can.

srubinacci · 2026-05-11T05:42:19Z

Yes, I do a agree we should work towards a v2.1 release!
There are a larger set of additions on my end (mostly on the logic/algorithmic side, related to research work), but these can be added later.

tfenne · 2026-06-09T14:05:25Z

@srubinacci anything I can do to help move this and my other PRs along? I don't have access to UKBB, but I'd be happy to run larger scale concordance on any public dataset and share scripts/pipelines and results.

srubinacci · 2026-06-10T09:12:22Z

Hi!
Sure, more validation (e.g. on 1000 Genomes data) will of course help.
But we'll eventually need to run this on UKB ref panel. Now, as you probably know, UKB is still down. It's unclear when the will allow researchers to work on it again, but it will likely come back at some point this month. Will update when that will be available.

The key point is that I did not forget about this, it's just waiting for the UKB to become available.
Simone

tfenne · 2026-06-10T12:21:29Z

Thanks Simone! Yeah - I see the current timeline for UKB-RAP is "we'll communicate in June what the timeline is for re-opening", which I read as we'll get an update, but probably not access in June.

I wonder would it be sufficient to do some benchmarking using HRC 1.1? I don't have access to that, but I could submit an application today if you felt like that ~64k haplotype panel would be enough to give confidence in the changes.

tfenne added 3 commits May 4, 2026 18:40

tfenne mentioned this pull request May 5, 2026

Bit-batch phasing_hmm helpers: process 8 states per Hvar byte read for 10-15% perf improvement #292

Open

srubinacci self-requested a review May 6, 2026 06:12

tfenne added a commit to tfenne/GLIMPSE that referenced this pull request May 8, 2026

Merge PR odelaneau#291: Optimize compactSelection (25-50% wall-time r…

781778f

…eduction)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize compactSelection: 25-50% wall-time reduction in GLIMPSE2_phase, byte-identical results#291

Optimize compactSelection: 25-50% wall-time reduction in GLIMPSE2_phase, byte-identical results#291
tfenne wants to merge 3 commits into
odelaneau:masterfrom
tfenne:tf_optimize_phase

tfenne commented May 5, 2026

Uh oh!

srubinacci commented May 6, 2026

Uh oh!

tfenne commented May 9, 2026

Uh oh!

srubinacci commented May 11, 2026

Uh oh!

tfenne commented Jun 9, 2026

Uh oh!

srubinacci commented Jun 10, 2026

Uh oh!

tfenne commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tfenne commented May 5, 2026

Preamble

Summary

What this is, and what it is not

The three commits

Performance

Cat genome reference panel (Nrh=1984, L=434360 sites), single individual

GLIMPSE2 tutorial (NA12878 1x, 1000GP-no-NA12878, 16 chr22 chunks)

Profile shape (chrA1, x86, default --Kpbwt 2000)

Caveats and notes

Uh oh!

srubinacci commented May 6, 2026

Uh oh!

tfenne commented May 9, 2026

Uh oh!

srubinacci commented May 11, 2026

Uh oh!

tfenne commented Jun 9, 2026

Uh oh!

srubinacci commented Jun 10, 2026

Uh oh!

tfenne commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants