Summary
In phase/src/models/phasing_hmm.cpp, the rephaseHaplotypes() function has a typo in the VAR_FLAT_HET branch that writes to H0 on both lines instead of writing H0 then H1. The second assignment overwrites the first, and H1 is never updated at these sites.
Location
phase/src/models/phasing_hmm.cpp, lines 283–284 (as of v2.0.1 / commit 1a9826a):
const bool rf = (rng.getFloat()*sum) < p01;
H0[VAR_ABS[curr_idx_locus]] = rf;
H0[VAR_ABS[curr_idx_locus]] = !rf; // <-- should be H1
The net effect is that H0 receives !rf (second write overwrites the first) and H1 retains whatever value sampleHaplotypeH1() assigned earlier in the iteration.
Expected behaviour
The code should write:
H0[VAR_ABS[curr_idx_locus]] = rf;
H1[VAR_ABS[curr_idx_locus]] = !rf;
This matches the pattern used everywhere else in the same function:
VAR_PEAK_HET branch (lines 277–278): writes H0 then H1 ✓
- Monomorphic het shuffling (lines 306–307): writes
H0 then H1 ✓
Affected versions
The bug was introduced in the initial v2.0.0 commit (1a9826a, 2022-11-30) and is present in all subsequent versions including v2.0.1, master, and the ap-field branch.
Practical impact
Low. We investigated this thoroughly and believe it has negligible effect on output accuracy, for the following reasons:
-
Only VAR_FLAT_HET sites are affected — these are heterozygous sites where the genotype likelihoods are uninformative (no sequencing reads, or low quality). By definition, there is little to no data at these sites.
-
Output posteriors (GP/DS) are not directly affected — storeGenotypePosteriorsAndHaplotypes() uses the imputation HMM posteriors (HP0, HP1), not the H0/H1 arrays directly. The imputation HMM skips flat sites entirely, so their posteriors are correctly uninformative regardless of this bug.
-
The phasing HMM effect is minimal — in the next Gibbs iteration, reallocate() reads H0/H1 to classify sites. With the bug, some flat hets may appear homozygous by chance (H0 == H1) and be dropped from the phasing HMM. But flat-het emissions are already uniform (uninformative), so dropping them removes essentially no phasing signal.
-
PBWT propagation is negligible — updateHaplotypes() does copy all H0/H1 values (including flat sites) into the target haplotype panel for PBWT matching. However, the "correct" values at flat sites are themselves only weakly informed, the effect is symmetric/unbiased, and in typical single-sample or small-batch runs the target's noisy flat-site alleles are negligible against a large reference panel.
-
We verified experimentally — running the fixed binary against the original on real low-coverage WGS data (cattle, ~5952 reference haplotypes, chr10, 41 chunks) produced byte-identical BCF record streams across all chunks.
Suggested fix
One-character change:
--- a/phase/src/models/phasing_hmm.cpp
+++ b/phase/src/models/phasing_hmm.cpp
@@ -281,7 +281,7 @@
sum = p01+p10;
const bool rf = (rng.getFloat()*sum) < p01;
H0[VAR_ABS[curr_idx_locus]] = rf;
- H0[VAR_ABS[curr_idx_locus]] = !rf;
+ H1[VAR_ABS[curr_idx_locus]] = !rf;
curr_missing_locus++;
}
Happy to open a PR if preferred.
Summary
In
phase/src/models/phasing_hmm.cpp, therephaseHaplotypes()function has a typo in theVAR_FLAT_HETbranch that writes toH0on both lines instead of writingH0thenH1. The second assignment overwrites the first, andH1is never updated at these sites.Location
phase/src/models/phasing_hmm.cpp, lines 283–284 (as ofv2.0.1/ commit1a9826a):The net effect is that
H0receives!rf(second write overwrites the first) andH1retains whatever valuesampleHaplotypeH1()assigned earlier in the iteration.Expected behaviour
The code should write:
This matches the pattern used everywhere else in the same function:
VAR_PEAK_HETbranch (lines 277–278): writesH0thenH1✓H0thenH1✓Affected versions
The bug was introduced in the initial
v2.0.0commit (1a9826a, 2022-11-30) and is present in all subsequent versions includingv2.0.1,master, and theap-fieldbranch.Practical impact
Low. We investigated this thoroughly and believe it has negligible effect on output accuracy, for the following reasons:
Only
VAR_FLAT_HETsites are affected — these are heterozygous sites where the genotype likelihoods are uninformative (no sequencing reads, or low quality). By definition, there is little to no data at these sites.Output posteriors (GP/DS) are not directly affected —
storeGenotypePosteriorsAndHaplotypes()uses the imputation HMM posteriors (HP0,HP1), not the H0/H1 arrays directly. The imputation HMM skips flat sites entirely, so their posteriors are correctly uninformative regardless of this bug.The phasing HMM effect is minimal — in the next Gibbs iteration,
reallocate()reads H0/H1 to classify sites. With the bug, some flat hets may appear homozygous by chance (H0 == H1) and be dropped from the phasing HMM. But flat-het emissions are already uniform (uninformative), so dropping them removes essentially no phasing signal.PBWT propagation is negligible —
updateHaplotypes()does copy all H0/H1 values (including flat sites) into the target haplotype panel for PBWT matching. However, the "correct" values at flat sites are themselves only weakly informed, the effect is symmetric/unbiased, and in typical single-sample or small-batch runs the target's noisy flat-site alleles are negligible against a large reference panel.We verified experimentally — running the fixed binary against the original on real low-coverage WGS data (cattle, ~5952 reference haplotypes, chr10, 41 chunks) produced byte-identical BCF record streams across all chunks.
Suggested fix
One-character change:
Happy to open a PR if preferred.