Skip to content

Add GLIMPSE2_inspect tool for binary reference panel inspection#288

Open
tfenne wants to merge 1 commit into
odelaneau:masterfrom
tfenne:tf_inspect_bin
Open

Add GLIMPSE2_inspect tool for binary reference panel inspection#288
tfenne wants to merge 1 commit into
odelaneau:masterfrom
tfenne:tf_inspect_bin

Conversation

@tfenne

@tfenne tfenne commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

@srubinacci one more PR here for a new tool to inspect the binary reference panel files. I find this very useful when trying to debug things (errors, but also variance in runtime and memory usage). I've done my best to match existing style and conventions, and have included updates to the docs and Dockerfile.

It's not a ton of code, so hopefully not too burdensome to review. If you think this is a useful addition, I'd be very open to suggestions if you think there's other output that would be useful, or other breakdowns, etc.

Statistics reported:

  • File size, chromosome, input/output regions (bp and Mbp)
  • Genetic map span for input and output regions (cM)
  • Haplotype count
  • Variant counts: total, common, rare, common HQ, low quality
  • Variant types: SNP, MNP, indel, other (using htslib VCF_* bitmask)
  • Allele frequency distribution: monomorphic, singletons, MAC 2-5, and MAF bins (<1%, 1-5%, 5-50%)
  • Core (output region) vs buffer variant breakdown

Usage: GLIMPSE2_inspect -I panel.bin

Example Output:

[GLIMPSE2] Inspect binary reference panel
  * Authors              : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact              : simone.rubinacci@unil.ch & olivier.delaneau@unil.ch
  * Version              : GLIMPSE2_inspect v2.0.0 / commit = a095cfd / release = 2026-04-17
  * Citation             : BiorXiv, (2022). DOI: https://doi.org/10.1101/2022.11.28.518213
  *                      : Nature Genetics 53, 120-126 (2021). DOI: https://doi.org/10.1038/s41588-020-00756-0
  * Run date             : 17/04/2026 - 06:40:23
  * Binary reference panel read (0.23s)

Binary reference panel summary:
  * File                 : autosome_chrD3_34907769_46737256.bin
  * File size            : 162.08 MB
  * Chromosome           : chrD3

  * Input region         : chrD3:34907769-46737256 (11.83 Mbp)
  * Output region        : chrD3:35407768-43979210 (8.57 Mbp)
  * Genetic map (input)  : 0.0000 - 27.8580 cM (27.8580 cM span)
  * Genetic map (output) : 1.4474 - 26.8580 cM (25.4106 cM span)

  * Haplotypes           : 1,984
  * Variants (total)     : 505,119
  *   Common             : 505,119 (100.0%)
  *   Rare               : 0 (0.0%)
  *   Common HQ          : 421,621 (83.5%)
  *   Low quality        : 83,498 (16.5%)

  * Variant types:
  *   SNPs               : 431,206 (85.4%)
  *   Indels             : 73,913 (14.6%)

  * Allele frequency distribution:
  *   Singletons         : 18 (0.0%)
  *   MAC 2-5            : 92,074 (18.2%)
  *   MAF < 1%           : 200,226 (39.6%)
  *   MAF 1-5%           : 85,782 (17.0%)
  *   MAF 5-50%          : 127,019 (25.1%)

  * Region breakdown:
  *   Core (output)      : 378,201 variants
  *   Buffer only        : 126,918 variants

GLIMPSE2's binary reference panels (.bin files) are opaque once created
by split_reference. This new tool deserializes a .bin file and reports
summary statistics that are useful for debugging, validation, and
understanding chunk characteristics.

Statistics reported:
- File size, chromosome, input/output regions (bp and Mbp)
- Genetic map span for input and output regions (cM)
- Haplotype count
- Variant counts: total, common, rare, common HQ, low quality
- Variant types: SNP, MNP, indel, other (using htslib VCF_* bitmask)
- Allele frequency distribution: monomorphic, singletons, MAC 2-5,
  and MAF bins (<1%, 1-5%, 5-50%)
- Core (output region) vs buffer variant breakdown

Usage: GLIMPSE2_inspect -I panel.bin

Follows the existing project structure: shared source files under
src/{containers,io,objects,utils} are symlinks into common/src/ (same
pattern used by every other module), and the makefile is a one-line
include of ../common.mk. No changes to any existing tool.
@srubinacci

Copy link
Copy Markdown
Collaborator

Hi, this is an interesting addition, as indeed currently the binary files are a bit of a black box.

Been thinking we can add a "unsplit" option? where we recreate the original BCF (as an option, only if specified), to guarantee retrieval of the original reference haplotypes?

Simone

@tfenne

tfenne commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

That's a nice idea @srubinacci - I think it would be easy enough to extend this command to emit a BCF optionally. The command as it stands only works on a single .bin at a time, but I wonder if I could make it emit BCFs that could be fed into GLIMPSE_ligate to put the whole back together again...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants