Skip to content

Optimize compactSelection: 25-50% wall-time reduction in GLIMPSE2_phase, byte-identical results#291

Open
tfenne wants to merge 3 commits into
odelaneau:masterfrom
tfenne:tf_optimize_phase
Open

Optimize compactSelection: 25-50% wall-time reduction in GLIMPSE2_phase, byte-identical results#291
tfenne wants to merge 3 commits into
odelaneau:masterfrom
tfenne:tf_optimize_phase

Conversation

@tfenne

@tfenne tfenne commented May 5, 2026

Copy link
Copy Markdown
Contributor

Preamble

The changes in this PR yield real and substantial savings in GLIMPSE2_phase runtimes in my hands, on the order of 25-50% reduction in runtime. I've done my best to benchmark what I think are representative uses of GLIMPSE2_phase, but I'm also aware that the range of panels sizes and compositions, and phasing settings are likely significantly broader than I've tested.

These changes have been tested and shown to yield similar reductions in runtime on the following samples:

  • The example 1000G data in the tutorial
  • a ~2X coverage sample with a ~1950 hap panel, with -Kpbwt at 2000, and again at 1900
  • a ~0.2X coverage sample with a ~1950 hap panel, with -Kpbwt at 2000, and again at 1900

Summary

conditioning_set::compactSelection was the single biggest hotspot in GLIMPSE2_phase on the workloads I profiled — about 35% of wall-time on x86 and about 25% on arm64 (where simde halves the AVX2 throughput, inflating the relative cost of the surrounding HMM kernels). This PR introduces two small, isolated changes to compactSelection plus one trivial dead-code removal. End-to-end wall-time drops by 25-50% depending on coverage, panel size, and --Kpbwt. Output is byte-identical to master across every workload I tested.

What this is, and what it is not

This PR does not touch any HMM kernel, any floating-point operation, any emission/transition computation, or any RNG draw. The output is determined entirely by the HMM kernels and those are byte-identical. The only changes are:

  • avoiding 8 byte read-modify-writes per output byte when constructing the Hvar bitmatrix, by writing 8 packed bits at once;
  • skipping a fully-redundant rebuild of compactSelection's output when the conditioning set is the entire reference panel (the iota path that fires by default whenever --Kpbwt >= n_ref_haps);
  • removing four std::fill(...,0.0f) calls in phasing_hmm::reallocate whose targets are unconditionally overwritten by std::copy before they are ever read.

The three commits

1. Pack 8 bits per byte when building Hvar in compactSelection

The hot inner loop was:

for (int k = 0; k < idxHaps_ref.size(); k++)
    Hvar.set(lrel, k, H.HvarRef.get(lcom, idxHaps_ref[k]));

Each set() is a read-modify-write on a single byte, so we do 8 RMW ops per output byte (~860M byte ops per call on a typical chunk). Replacing this with 8 reads + 1 byte store via a new bitmatrix::setByte helper produces the same bit pattern with a single cache-friendly write. Trailing bits when n_states is not a multiple of 8 still go through set() so any pre-existing content in unused tail bits of the row's last byte is left untouched.

2. Cache compactSelection result for the full-panel selection path

When --Kpbwt >= n_ref_haps, idxHaps_ref ends up as iota(0, n_ref_haps) every burn/main iteration. The derived state (var_type, polymorphic_sites, monomorphic_sites, Svar, Hvar, t, nt) depends only on the reference panel — not on the individual or the iteration. The code currently rebuilds all of it on every call (~21 calls per chunk for the default 5+15 burn/main settings).

This commit caches the result, keyed on n_ref_haps. The cache is invalidated whenever any non-iota selection path runs (STAGE_INIT with Kinit > 0, STAGE_RESTRICT, the PBWT subset path, or the list_states post-processing). When --Kpbwt < n_ref_haps, the cache is never populated and the only overhead is one integer compare and one bool check per call.

The cache assumes H is immutable for the lifetime of this conditioning_set, which holds today — each worker thread gets a conditioning_set allocated once and never reseated to a different panel. I documented this invariant in the header.

3. Remove dead zero-fills in phasing_hmm::reallocate

phasingProb, phasingProbSum, imputeProb, and imputeProbSum are each fully overwritten by std::copy from prob/probSumH inside backward() before forward() (or IMPUTE_FLAT_HET()) reads them. The four std::fill calls were dead. Trivial.

Performance

All numbers are wall-time on a single chunk, single thread, end-to-end (binary entry to exit). Verified byte-identical via bcftools view -H | md5sum on every chunk listed.

Cat genome reference panel (Nrh=1984, L=434360 sites), single individual

Workload Master Branch Reduction
1.9x BAM, --Kpbwt 2000 (default), 5 chunks mean x86: 130s    arm64: 140s x86: 84s    arm64: 66s x86 ~35%    arm64 ~53%
0.2x BAM, --Kpbwt 2000, chrA1 x86: 105s x86: 58s x86 ~45%
1.9x BAM, --Kpbwt 1900 (PBWT subset), chrA1 x86: 74s    arm64: 78s x86: 54s    arm64: 41s x86 ~27%    arm64 ~47%

GLIMPSE2 tutorial (NA12878 1x, 1000GP-no-NA12878, 16 chr22 chunks)

This is the public dataset under tutorial/. All 16 chunks PASS bcftools view -H | md5sum byte-identity.

Master Branch Reduction
arm64, full step5 (16 chunks) 278s 184s 33.8%

The 1000GP-vs-cat gap (33.8% vs ~53% on arm64) is because --Kpbwt 2000 < Nrh=5006 for the 1000GP panel, so the cache (commit 2) never fires; only the bit-pack (commit 1) is active.

Profile shape (chrA1, x86, default --Kpbwt 2000)

Function Master After this PR
conditioning_set::compactSelection 34.9% 1.0%
imputation_hmm::backward 18.0% 27.4%
phasing_hmm::forward 17.6% 27.2%
phasing_hmm::backward 14.4% 21.8%
imputation_hmm::forward 9.6% 14.5%

compactSelection is effectively eliminated. What remains is genuine HMM compute that is already heavily SIMD-optimized.

Caveats and notes

  • --Kpbwt < n_ref_haps: commit 2's cache never fires; commit 1 still does. Speedup drops to ~25-30% on x86. No functional change.
  • conditioning_set::clear() is declared but has no implementation anywhere in the tree (pre-existing). I did not add one. If this method is ever defined and used to reset the object for reuse, it must reset cached_full_panel_n = 0 and transitions_valid = false. This is documented in a comment on the declaration.
  • No new unit tests, in keeping with the project's existing test approach (end-to-end tutorial scripts as ground truth). The byte-identity check via bcftools view -H | md5sum is the validation I relied on.
  • Builds clean on Linux/x86_64 (gcc 11, native AVX2) and macOS/arm64 (clang, NEON via simde).

tfenne added 3 commits May 4, 2026 18:40
Profiling GLIMPSE2_phase on a single 1.9x-coverage diploid sample with
~1984 reference haplotypes (--Kpbwt 2000, so the iota selection path is
taken every burn/main iteration) showed conditioning_set::compactSelection
consuming ~35% of wall time on x86 -- more than any individual HMM kernel.

The dominant cost was the inner loop that builds the Hvar bitmatrix:

    for (int k = 0 ; k < idxHaps_ref.size() ; k++)
        Hvar.set(lrel, k, H.HvarRef.get(lcom, idxHaps_ref[k]));

Each set() does a read-modify-write on a single byte -- 8 RMW byte ops per
output byte, ~860M byte ops per call.

Packing 8 selected reference bits into one byte and storing it directly via
a new bitmatrix::setByte helper turns those 8 RMW ops into 8 reads and 1
store. The output byte content is identical, so the final VCF/BCF is
bit-exact against the previous code.

Trailing bits when n_states is not a multiple of 8 still go through the
original set() path so any pre-existing content in the unused tail of the
last byte of each row is left untouched.

Measured wall-time reductions on a single chrA1 chunk, end-to-end:

  * x86 (Intel Xeon Granite Rapids, native AVX2): 129s -> 96s  (~26%)
  * arm64 (Apple Silicon, AVX2 -> NEON via simde): 140s -> 79s (~44%)

The arm64 win is larger because simde turns each 256-bit AVX2 op into two
128-bit NEON ops, which inflates the relative weight of the (scalar)
compactSelection bit-pack work versus the SIMD HMM kernels.

Validated byte-identical output (bcftools view -H | md5sum) across 5
chunks (chrA1/A2/B2/C1/D1) on both architectures.
When --Kpbwt is >= the number of reference haplotypes, compactSelection
takes the iota branch and idxHaps_ref ends up holding every reference
haplotype index (0..n_ref_haps-1). In that case the derived state --
var_type, polymorphic_sites, monomorphic_sites, Svar, and the Hvar
bitmatrix -- depends only on the reference panel, not on the individual
or the iteration. The same is true of the transition vectors t and nt.

GLIMPSE2_phase still rebuilds all of that work on every iteration (~21
calls per chunk for typical 5+15 burn/main settings). After the prior
commit reduced compactSelection's per-call cost, the function still sat
at ~12% of x86 wall and was still doing that fully-redundant work.

Cache it: track the n_ref_haps value the full-panel state was last built
for, and short-circuit compactSelection when the next call would produce
the same selection. updateTransitions short-circuits via a parallel flag
that's cleared whenever compactSelection rebuilds.

The cache is always invalidated when the use_list path runs (init_states,
PBWT-selected subsets, or any future selection mode), so it only short-
circuits when the next call's idxHaps_ref would be byte-identical to the
last one. Output is byte-exact against the previous code.

The cache assumes H is immutable for the lifetime of this conditioning_set,
which holds today (one conditioning_set per worker thread, allocated for
the duration of a chunk and never reseated to a different panel).

Wall-time on the same 5 cat-genome chunks, single 1.9x diploid, --Kpbwt 2000:

  * x86 (Intel Xeon Granite Rapids): 96s -> 84s   (additional ~12% over prior commit; ~35% vs baseline)
  * arm64 (Apple Silicon, simde):    79s -> 66s   (additional ~16% over prior commit; ~53% vs baseline)

This optimization only fires when --Kpbwt >= n_ref_haps. Runs that select
a PBWT subset (--Kpbwt < n_ref_haps) keep their existing per-iteration
recompute path -- the cache in that case would always miss, so the only
overhead is one integer compare and one flag check per call.
phasingProb, phasingProbSum, imputeProb and imputeProbSum are each fully
overwritten via std::copy from prob/probSumH inside backward() before
forward() (or IMPUTE_FLAT_HET) ever read them. The std::fill(...,0.0f)
calls on those four buffers are dead work.

Profiling showed __memset_avx512_unaligned_erms + __memmove combined at
~5% on x86 post-cache, of which a portion was these fills. Removing them
is byte-identical against the previous code (verified across 5 chunks)
and trims a small but free amount of overhead.
@srubinacci

Copy link
Copy Markdown
Collaborator

Thank you! Will have to benchmark this a bit (might take a bit of time, the UKB platform is down at the moment), but this sounds amazing! 👍

@srubinacci srubinacci self-requested a review May 6, 2026 06:12
tfenne added a commit to tfenne/GLIMPSE that referenced this pull request May 8, 2026
@tfenne

tfenne commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @srubinacci. No rush on my end - I figured it might take a while to review. For my own use, I'm running GLIMPSE via docker using a custom container that has some of my changes merged down into a single branch/tag. That said, it would be nice to work towards a GLIMPSE release (2.0.1? 2.1?) some time over the summer, and update what's available in bioconda etc. I'm happy to help with that however I can.

@srubinacci

Copy link
Copy Markdown
Collaborator

Yes, I do a agree we should work towards a v2.1 release!
There are a larger set of additions on my end (mostly on the logic/algorithmic side, related to research work), but these can be added later.

@tfenne

tfenne commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@srubinacci anything I can do to help move this and my other PRs along? I don't have access to UKBB, but I'd be happy to run larger scale concordance on any public dataset and share scripts/pipelines and results.

@srubinacci

Copy link
Copy Markdown
Collaborator

Hi!
Sure, more validation (e.g. on 1000 Genomes data) will of course help.
But we'll eventually need to run this on UKB ref panel. Now, as you probably know, UKB is still down. It's unclear when the will allow researchers to work on it again, but it will likely come back at some point this month. Will update when that will be available.

The key point is that I did not forget about this, it's just waiting for the UKB to become available.
Simone

@tfenne

tfenne commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Thanks Simone! Yeah - I see the current timeline for UKB-RAP is "we'll communicate in June what the timeline is for re-opening", which I read as we'll get an update, but probably not access in June.

I wonder would it be sufficient to do some benchmarking using HRC 1.1? I don't have access to that, but I could submit an application today if you felt like that ~64k haplotype panel would be enough to give confidence in the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants