Skip to content

concordance: support aliasing imputed sample IDs to truth-VCF IDs#294

Open
tfenne wants to merge 3 commits into
odelaneau:masterfrom
tfenne:tf_eval_sample_mapping
Open

concordance: support aliasing imputed sample IDs to truth-VCF IDs#294
tfenne wants to merge 3 commits into
odelaneau:masterfrom
tfenne:tf_eval_sample_mapping

Conversation

@tfenne

@tfenne tfenne commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

GLIMPSE2_concordance previously matched samples between the imputed VCF and
the validation VCF by exact sample-ID equality. This blocks a common
evaluation workflow: testing the same biological sample (e.g. NA12878)
imputed under many conditions (e.g. multiple downsamplings, different GLIMPSE2_phase settings, etc.) against a single truth row.

This PR extends the existing --samples file with an optional second
whitespace-separated column per row giving the corresponding sample ID in the
validation VCF. The second column is optional on a per-row basis, so a single
file may freely mix aliased and non-aliased rows. Multiple imputed rows may
alias the same validation row.

# samples.txt -- first column = imputed ID, optional second column = validation ID
NA12878.0_5x  NA12878
NA12878.1x    NA12878
NA12878.2x    NA12878
HG00096
HG00097

Implementation notes

  • The validation-side header is subset to only the K unique validation
    IDs needed (K <= N imputed rows). A length-N imputed_to_truth_col[i]
    array maps each imputed row to its column in the validation subset.
    This preserves the existing per-record decode cost in the common case
    (many imputed reps -> one truth sample): the validation side still
    decodes K columns per record, not the full validation file.
  • Per-sample output rows are now labelled with the imputed-side ID so each
    replicate appears as its own row.
  • A defensive per-record check now fails with a clear error if the
    validation file's per-sample PL/GT stride disagrees with the
    imputed-side ploidy at a site (reachable when the user aliases samples
    of differing ploidy).
  • Documentation (docs/docs/documentation/concordance.md) updated with a
    "Sample aliasing" section and an expanded --samples description.

Backwards compatibility

  • 1-column --samples files behave exactly as before.
  • No---samples runs produce the same per-sample numbers as before. The
    per-sample output row order changes from std::unordered_set
    iteration order (i.e. hash-bucket order, libstdc++-version-dependent)
    to imputed-VCF header order. Strictly an improvement in determinism;
    flagging since it affects diffs of *.error.spl.txt.gz and
    *.rsquare.spl.txt.gz.

Testing

The repo has no unit-test framework, and the tutorial pipeline doesn't
exercise sample subsetting, so the new path was tested manually:

  • The tutorial concordance step (tutorial/step7_script_concordance.sh)
    was re-run after each commit; all five output files
    (output.error.{spl,grp,cal}.txt.gz, output.rsquare.{grp,spl}.txt.gz)
    are bit-identical to a pre-change baseline.
  • The aliasing path was exercised by running with a renamed copy of the
    imputed VCF and a 2-column --samples file aliasing the renamed ID
    back to the original truth ID; per-sample numbers matched the baseline
    with the imputed-side ID labelling the row.
  • The new ploidy-mismatch check was verified by forcing a haploid
    validation VCF (via bcftools +fixploidy -f 1) against the original
    diploid imputed file; the check fails with a clear chrom:pos error
    instead of producing corrupted output.

tfenne added 3 commits May 9, 2026 05:48
GLIMPSE2_concordance previously matched samples between the imputed VCF
and the truth VCF by exact sample-ID string equality, which blocks
evaluating the same biological sample at multiple downsampling levels
(e.g. NA12878.0_5x / NA12878.1x / NA12878.2x against a single NA12878
truth row).

Extend the existing --samples file to accept an optional second
whitespace-separated column per row giving the corresponding sample ID
in the truth VCF. The second column is optional on a per-row basis so a
single file may freely mix aliased and non-aliased rows. Rows without
an alias keep the prior behavior (truth ID == imputed ID). Multiple
imputed rows may map to the same truth row.

Internally, the truth header is now subset to only the *unique* truth
IDs needed (K), and an imputed_to_truth_col[i] mapping is consulted in
the per-record truth-reading loop to pick the right column. This
preserves the existing per-record decode efficiency: when many imputed
rows alias a single truth sample (the common case), the truth side
still decodes only K samples per record rather than the full file.

Per-sample output rows are labelled with the imputed-side ID so each
downsampled rep gets its own row.
Address reviewer feedback on the sample-aliasing change:
- Remove dead `subset_samples_set` field. It was only ever read from a
  commented-out block; `subset_samples` is authoritative.
- Convert non-ASCII glyphs (em-dash, arrow) in source comments and CLI
  help to ASCII to match the rest of the codebase.
- Trim the `--samples` row in the docs table to a one-line summary that
  forward-references the "Sample aliasing" section, which already
  carries the full description and example.
- Sync the `--samples` `--help` text with the markdown.
- Tighten the `imputed_to_truth_col` field comment to spell out the
  indexing math (`dp_arr_t[i]`, `pl_arr_t[ploidy_t_record*i]`).
- Drop a what-comment in the per-record truth-decode loop and rewrite
  the surviving comment to explain the why of iterating by imputed-row
  rather than truth-column.
- Fix a pre-existing "phread"/"phred" typo in `--gt-val` help and docs.
The per-record truth-decode loop indexes pl_arr_t / dp_arr_t with a
mix of per-record (`ploidy_t_record = npl_t / n_true_samples`) and
per-run (`ploidy`, `ploidyP1`) strides. If the validation file's
per-sample PL/GT stride at a given record disagrees with what the
imputed-side ploidy expects, the reads silently walk into the next
sample's slot and produce nonsense concordance numbers.

This was already a latent issue in the existing code, but pre-aliasing
it required a malformed pair of VCFs (same sample ID, different
ploidy). Sample aliasing makes it constructible by user error, e.g.
mapping a diploid imputed sample to a haploid validation sample.

Add a per-record check that the truth-side stride matches what the
imputed-side ploidy implies (`ploidy` under --gt-val, otherwise
`ploidy + 1`), failing with a clear error that names the chrom:pos
and points at --samples aliasing as the likely cause.
@srubinacci srubinacci self-requested a review May 11, 2026 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant