concordance: support aliasing imputed sample IDs to truth-VCF IDs by tfenne · Pull Request #294 · odelaneau/GLIMPSE

tfenne · 2026-05-10T12:30:53Z

Summary

GLIMPSE2_concordance previously matched samples between the imputed VCF and
the validation VCF by exact sample-ID equality. This blocks a common
evaluation workflow: testing the same biological sample (e.g. NA12878)
imputed under many conditions (e.g. multiple downsamplings, different GLIMPSE2_phase settings, etc.) against a single truth row.

This PR extends the existing --samples file with an optional second
whitespace-separated column per row giving the corresponding sample ID in the
validation VCF. The second column is optional on a per-row basis, so a single
file may freely mix aliased and non-aliased rows. Multiple imputed rows may
alias the same validation row.

# samples.txt -- first column = imputed ID, optional second column = validation ID
NA12878.0_5x  NA12878
NA12878.1x    NA12878
NA12878.2x    NA12878
HG00096
HG00097

Implementation notes

The validation-side header is subset to only the K unique validation
IDs needed (K <= N imputed rows). A length-N imputed_to_truth_col[i]
array maps each imputed row to its column in the validation subset.
This preserves the existing per-record decode cost in the common case
(many imputed reps -> one truth sample): the validation side still
decodes K columns per record, not the full validation file.
Per-sample output rows are now labelled with the imputed-side ID so each
replicate appears as its own row.
A defensive per-record check now fails with a clear error if the
validation file's per-sample PL/GT stride disagrees with the
imputed-side ploidy at a site (reachable when the user aliases samples
of differing ploidy).
Documentation (docs/docs/documentation/concordance.md) updated with a
"Sample aliasing" section and an expanded --samples description.

Backwards compatibility

1-column --samples files behave exactly as before.
No---samples runs produce the same per-sample numbers as before. The
per-sample output row order changes from std::unordered_set
iteration order (i.e. hash-bucket order, libstdc++-version-dependent)
to imputed-VCF header order. Strictly an improvement in determinism;
flagging since it affects diffs of *.error.spl.txt.gz and
*.rsquare.spl.txt.gz.

Testing

The repo has no unit-test framework, and the tutorial pipeline doesn't
exercise sample subsetting, so the new path was tested manually:

The tutorial concordance step (tutorial/step7_script_concordance.sh)
was re-run after each commit; all five output files
(output.error.{spl,grp,cal}.txt.gz, output.rsquare.{grp,spl}.txt.gz)
are bit-identical to a pre-change baseline.
The aliasing path was exercised by running with a renamed copy of the
imputed VCF and a 2-column --samples file aliasing the renamed ID
back to the original truth ID; per-sample numbers matched the baseline
with the imputed-side ID labelling the row.
The new ploidy-mismatch check was verified by forcing a haploid
validation VCF (via bcftools +fixploidy -f 1) against the original
diploid imputed file; the check fails with a clear chrom:pos error
instead of producing corrupted output.

GLIMPSE2_concordance previously matched samples between the imputed VCF and the truth VCF by exact sample-ID string equality, which blocks evaluating the same biological sample at multiple downsampling levels (e.g. NA12878.0_5x / NA12878.1x / NA12878.2x against a single NA12878 truth row). Extend the existing --samples file to accept an optional second whitespace-separated column per row giving the corresponding sample ID in the truth VCF. The second column is optional on a per-row basis so a single file may freely mix aliased and non-aliased rows. Rows without an alias keep the prior behavior (truth ID == imputed ID). Multiple imputed rows may map to the same truth row. Internally, the truth header is now subset to only the *unique* truth IDs needed (K), and an imputed_to_truth_col[i] mapping is consulted in the per-record truth-reading loop to pick the right column. This preserves the existing per-record decode efficiency: when many imputed rows alias a single truth sample (the common case), the truth side still decodes only K samples per record rather than the full file. Per-sample output rows are labelled with the imputed-side ID so each downsampled rep gets its own row.

Address reviewer feedback on the sample-aliasing change: - Remove dead `subset_samples_set` field. It was only ever read from a commented-out block; `subset_samples` is authoritative. - Convert non-ASCII glyphs (em-dash, arrow) in source comments and CLI help to ASCII to match the rest of the codebase. - Trim the `--samples` row in the docs table to a one-line summary that forward-references the "Sample aliasing" section, which already carries the full description and example. - Sync the `--samples` `--help` text with the markdown. - Tighten the `imputed_to_truth_col` field comment to spell out the indexing math (`dp_arr_t[i]`, `pl_arr_t[ploidy_t_record*i]`). - Drop a what-comment in the per-record truth-decode loop and rewrite the surviving comment to explain the why of iterating by imputed-row rather than truth-column. - Fix a pre-existing "phread"/"phred" typo in `--gt-val` help and docs.

The per-record truth-decode loop indexes pl_arr_t / dp_arr_t with a mix of per-record (`ploidy_t_record = npl_t / n_true_samples`) and per-run (`ploidy`, `ploidyP1`) strides. If the validation file's per-sample PL/GT stride at a given record disagrees with what the imputed-side ploidy expects, the reads silently walk into the next sample's slot and produce nonsense concordance numbers. This was already a latent issue in the existing code, but pre-aliasing it required a malformed pair of VCFs (same sample ID, different ploidy). Sample aliasing makes it constructible by user error, e.g. mapping a diploid imputed sample to a haploid validation sample. Add a per-record check that the truth-side stride matches what the imputed-side ploidy implies (`ploidy` under --gt-val, otherwise `ploidy + 1`), failing with a clear error that names the chrom:pos and points at --samples aliasing as the likely cause.

tfenne added 3 commits May 9, 2026 05:48

srubinacci self-requested a review May 11, 2026 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concordance: support aliasing imputed sample IDs to truth-VCF IDs#294

concordance: support aliasing imputed sample IDs to truth-VCF IDs#294
tfenne wants to merge 3 commits into
odelaneau:masterfrom
tfenne:tf_eval_sample_mapping

tfenne commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tfenne commented May 10, 2026

Summary

Implementation notes

Backwards compatibility

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant