Skip to content

rephaseHaplotypes() writes H0 twice instead of H0 then H1 at VAR_FLAT_HET sites #297

Description

@chebizarro

Summary

In phase/src/models/phasing_hmm.cpp, the rephaseHaplotypes() function has a typo in the VAR_FLAT_HET branch that writes to H0 on both lines instead of writing H0 then H1. The second assignment overwrites the first, and H1 is never updated at these sites.

Location

phase/src/models/phasing_hmm.cpp, lines 283–284 (as of v2.0.1 / commit 1a9826a):

const bool rf = (rng.getFloat()*sum) < p01;
H0[VAR_ABS[curr_idx_locus]] = rf;
H0[VAR_ABS[curr_idx_locus]] = !rf;   // <-- should be H1

The net effect is that H0 receives !rf (second write overwrites the first) and H1 retains whatever value sampleHaplotypeH1() assigned earlier in the iteration.

Expected behaviour

The code should write:

H0[VAR_ABS[curr_idx_locus]] = rf;
H1[VAR_ABS[curr_idx_locus]] = !rf;

This matches the pattern used everywhere else in the same function:

  • VAR_PEAK_HET branch (lines 277–278): writes H0 then H1
  • Monomorphic het shuffling (lines 306–307): writes H0 then H1

Affected versions

The bug was introduced in the initial v2.0.0 commit (1a9826a, 2022-11-30) and is present in all subsequent versions including v2.0.1, master, and the ap-field branch.

Practical impact

Low. We investigated this thoroughly and believe it has negligible effect on output accuracy, for the following reasons:

  1. Only VAR_FLAT_HET sites are affected — these are heterozygous sites where the genotype likelihoods are uninformative (no sequencing reads, or low quality). By definition, there is little to no data at these sites.

  2. Output posteriors (GP/DS) are not directly affectedstoreGenotypePosteriorsAndHaplotypes() uses the imputation HMM posteriors (HP0, HP1), not the H0/H1 arrays directly. The imputation HMM skips flat sites entirely, so their posteriors are correctly uninformative regardless of this bug.

  3. The phasing HMM effect is minimal — in the next Gibbs iteration, reallocate() reads H0/H1 to classify sites. With the bug, some flat hets may appear homozygous by chance (H0 == H1) and be dropped from the phasing HMM. But flat-het emissions are already uniform (uninformative), so dropping them removes essentially no phasing signal.

  4. PBWT propagation is negligibleupdateHaplotypes() does copy all H0/H1 values (including flat sites) into the target haplotype panel for PBWT matching. However, the "correct" values at flat sites are themselves only weakly informed, the effect is symmetric/unbiased, and in typical single-sample or small-batch runs the target's noisy flat-site alleles are negligible against a large reference panel.

  5. We verified experimentally — running the fixed binary against the original on real low-coverage WGS data (cattle, ~5952 reference haplotypes, chr10, 41 chunks) produced byte-identical BCF record streams across all chunks.

Suggested fix

One-character change:

--- a/phase/src/models/phasing_hmm.cpp
+++ b/phase/src/models/phasing_hmm.cpp
@@ -281,7 +281,7 @@
 			sum = p01+p10;
 			const bool rf = (rng.getFloat()*sum) < p01;
 			H0[VAR_ABS[curr_idx_locus]] = rf;
-			H0[VAR_ABS[curr_idx_locus]] = !rf;
+			H1[VAR_ABS[curr_idx_locus]] = !rf;
 			curr_missing_locus++;
 		}

Happy to open a PR if preferred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions