concordance: support aliasing imputed sample IDs to truth-VCF IDs#294
Open
tfenne wants to merge 3 commits into
Open
concordance: support aliasing imputed sample IDs to truth-VCF IDs#294tfenne wants to merge 3 commits into
tfenne wants to merge 3 commits into
Conversation
GLIMPSE2_concordance previously matched samples between the imputed VCF and the truth VCF by exact sample-ID string equality, which blocks evaluating the same biological sample at multiple downsampling levels (e.g. NA12878.0_5x / NA12878.1x / NA12878.2x against a single NA12878 truth row). Extend the existing --samples file to accept an optional second whitespace-separated column per row giving the corresponding sample ID in the truth VCF. The second column is optional on a per-row basis so a single file may freely mix aliased and non-aliased rows. Rows without an alias keep the prior behavior (truth ID == imputed ID). Multiple imputed rows may map to the same truth row. Internally, the truth header is now subset to only the *unique* truth IDs needed (K), and an imputed_to_truth_col[i] mapping is consulted in the per-record truth-reading loop to pick the right column. This preserves the existing per-record decode efficiency: when many imputed rows alias a single truth sample (the common case), the truth side still decodes only K samples per record rather than the full file. Per-sample output rows are labelled with the imputed-side ID so each downsampled rep gets its own row.
Address reviewer feedback on the sample-aliasing change: - Remove dead `subset_samples_set` field. It was only ever read from a commented-out block; `subset_samples` is authoritative. - Convert non-ASCII glyphs (em-dash, arrow) in source comments and CLI help to ASCII to match the rest of the codebase. - Trim the `--samples` row in the docs table to a one-line summary that forward-references the "Sample aliasing" section, which already carries the full description and example. - Sync the `--samples` `--help` text with the markdown. - Tighten the `imputed_to_truth_col` field comment to spell out the indexing math (`dp_arr_t[i]`, `pl_arr_t[ploidy_t_record*i]`). - Drop a what-comment in the per-record truth-decode loop and rewrite the surviving comment to explain the why of iterating by imputed-row rather than truth-column. - Fix a pre-existing "phread"/"phred" typo in `--gt-val` help and docs.
The per-record truth-decode loop indexes pl_arr_t / dp_arr_t with a mix of per-record (`ploidy_t_record = npl_t / n_true_samples`) and per-run (`ploidy`, `ploidyP1`) strides. If the validation file's per-sample PL/GT stride at a given record disagrees with what the imputed-side ploidy expects, the reads silently walk into the next sample's slot and produce nonsense concordance numbers. This was already a latent issue in the existing code, but pre-aliasing it required a malformed pair of VCFs (same sample ID, different ploidy). Sample aliasing makes it constructible by user error, e.g. mapping a diploid imputed sample to a haploid validation sample. Add a per-record check that the truth-side stride matches what the imputed-side ploidy implies (`ploidy` under --gt-val, otherwise `ploidy + 1`), failing with a clear error that names the chrom:pos and points at --samples aliasing as the likely cause.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GLIMPSE2_concordancepreviously matched samples between the imputed VCF andthe validation VCF by exact sample-ID equality. This blocks a common
evaluation workflow: testing the same biological sample (e.g.
NA12878)imputed under many conditions (e.g. multiple downsamplings, different GLIMPSE2_phase settings, etc.) against a single truth row.
This PR extends the existing
--samplesfile with an optional secondwhitespace-separated column per row giving the corresponding sample ID in the
validation VCF. The second column is optional on a per-row basis, so a single
file may freely mix aliased and non-aliased rows. Multiple imputed rows may
alias the same validation row.
Implementation notes
Kunique validationIDs needed (
K <= Nimputed rows). A length-Nimputed_to_truth_col[i]array maps each imputed row to its column in the validation subset.
This preserves the existing per-record decode cost in the common case
(many imputed reps -> one truth sample): the validation side still
decodes
Kcolumns per record, not the full validation file.replicate appears as its own row.
validation file's per-sample PL/GT stride disagrees with the
imputed-side ploidy at a site (reachable when the user aliases samples
of differing ploidy).
docs/docs/documentation/concordance.md) updated with a"Sample aliasing" section and an expanded
--samplesdescription.Backwards compatibility
--samplesfiles behave exactly as before.--samplesruns produce the same per-sample numbers as before. Theper-sample output row order changes from
std::unordered_setiteration order (i.e. hash-bucket order, libstdc++-version-dependent)
to imputed-VCF header order. Strictly an improvement in determinism;
flagging since it affects diffs of
*.error.spl.txt.gzand*.rsquare.spl.txt.gz.Testing
The repo has no unit-test framework, and the tutorial pipeline doesn't
exercise sample subsetting, so the new path was tested manually:
tutorial/step7_script_concordance.sh)was re-run after each commit; all five output files
(
output.error.{spl,grp,cal}.txt.gz,output.rsquare.{grp,spl}.txt.gz)are bit-identical to a pre-change baseline.
imputed VCF and a 2-column
--samplesfile aliasing the renamed IDback to the original truth ID; per-sample numbers matched the baseline
with the imputed-side ID labelling the row.
validation VCF (via
bcftools +fixploidy -f 1) against the originaldiploid imputed file; the check fails with a clear chrom:pos error
instead of producing corrupted output.