Add GLIMPSE2_inspect tool for binary reference panel inspection#288
Open
tfenne wants to merge 1 commit into
Open
Add GLIMPSE2_inspect tool for binary reference panel inspection#288tfenne wants to merge 1 commit into
tfenne wants to merge 1 commit into
Conversation
GLIMPSE2's binary reference panels (.bin files) are opaque once created
by split_reference. This new tool deserializes a .bin file and reports
summary statistics that are useful for debugging, validation, and
understanding chunk characteristics.
Statistics reported:
- File size, chromosome, input/output regions (bp and Mbp)
- Genetic map span for input and output regions (cM)
- Haplotype count
- Variant counts: total, common, rare, common HQ, low quality
- Variant types: SNP, MNP, indel, other (using htslib VCF_* bitmask)
- Allele frequency distribution: monomorphic, singletons, MAC 2-5,
and MAF bins (<1%, 1-5%, 5-50%)
- Core (output region) vs buffer variant breakdown
Usage: GLIMPSE2_inspect -I panel.bin
Follows the existing project structure: shared source files under
src/{containers,io,objects,utils} are symlinks into common/src/ (same
pattern used by every other module), and the makefile is a one-line
include of ../common.mk. No changes to any existing tool.
Collaborator
|
Hi, this is an interesting addition, as indeed currently the binary files are a bit of a black box. Been thinking we can add a "unsplit" option? where we recreate the original BCF (as an option, only if specified), to guarantee retrieval of the original reference haplotypes? Simone |
Contributor
Author
|
That's a nice idea @srubinacci - I think it would be easy enough to extend this command to emit a BCF optionally. The command as it stands only works on a single |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@srubinacci one more PR here for a new tool to inspect the binary reference panel files. I find this very useful when trying to debug things (errors, but also variance in runtime and memory usage). I've done my best to match existing style and conventions, and have included updates to the docs and Dockerfile.
It's not a ton of code, so hopefully not too burdensome to review. If you think this is a useful addition, I'd be very open to suggestions if you think there's other output that would be useful, or other breakdowns, etc.
Statistics reported:
Usage: GLIMPSE2_inspect -I panel.bin
Example Output: