Skip to content

Add --compression-level option for VCF/BCF output#298

Open
tfenne wants to merge 1 commit into
odelaneau:masterfrom
tfenne:compression-level-option
Open

Add --compression-level option for VCF/BCF output#298
tfenne wants to merge 1 commit into
odelaneau:masterfrom
tfenne:compression-level-option

Conversation

@tfenne

@tfenne tfenne commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Motivation

GLIMPSE2_phase, GLIMPSE2_ligate, and GLIMPSE2_concordance hardcode the htslib
output mode to compressed BCF ("wb") / compressed VCF ("wz"), so there is no way
to change the compression level or to emit an uncompressed BCF. Every run pays full
(level-6) compression cost, which is wasteful for large intermediate files that the
next pipeline stage immediately re-reads and decompresses.

@srubinacci this is a purely mechanical change in the output compression/writing layer, so hopefully can be reviewed (and merged) even without the ability to benchmark on UKB?

Change

Adds a --compression-level option (INT, default 6, range 0–9) to all three tools,
mirroring bcftools -l/--compression-level so the semantics are familiar:

  • The level is appended to the htslib mode string only for compressed formats
    (wb/wz); plain uncompressed VCF (.vcf) is left untouched.
  • Level 0 yields a BGZF-stored (uncompressed) BCF — still BGZF-framed and
    indexable, just not deflated — equivalent to bcftools -l 0.
  • The default of 6 matches htslib's implicit default, so output is byte-identical
    when the flag is unused.
Tool Output affected
phase main VCF/BCF output (BGEN path unaffected — it has its own --bgen-compr)
ligate ligated VCF/BCF output
concordance _rej_sites.bcf / _conc_sites.bcf / _disc_sites.bcf diagnostic outputs

Help text, startup logging, and the per-tool documentation tables are updated for all
three tools.

Testing

  • All three tools build cleanly.
  • Out-of-range values (e.g. --compression-level 12) are rejected with a clear error.
  • Level 0 produces an uncompressed BCF matching bcftools -Ob0; higher levels compress;
    decoded records are identical across levels.
  • Verified byte-identical default output against current master (matching MD5).
  • ligate's inline CSI index at level 0 verified usable (region query + bcftools index -s).

GLIMPSE2_phase, _ligate, and _concordance previously hardcoded the
htslib output mode to compressed BCF ("wb") / compressed VCF ("wz") with
no way to change the compression level or to emit an uncompressed BCF.
This forced every run to pay full (level-6) compression cost, which is
wasteful for large intermediate files that are immediately re-read by the
next pipeline stage.

Add a --compression-level option (INT, default 6, range 0-9) mirroring
bcftools' -l/--compression-level so the level is familiar to users.
The level is appended to the htslib mode string only for compressed
formats (wb/wz); plain uncompressed VCF (.vcf) is untouched. Level 0
yields a BGZF-stored (uncompressed) BCF, equivalent to `bcftools -l 0`.
The default of 6 matches htslib's implicit default, so output is
byte-identical when the flag is unused.

Help text, startup logging, and the documentation option tables are
updated for all three tools.
@srubinacci

Copy link
Copy Markdown
Collaborator

Thanks! Yes, very easy to add. I should be able to test this very quickly (no need for UKB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants