Skip to content

feat: Template#26

Open
msto wants to merge 20 commits into
mainfrom
ms_20_template
Open

feat: Template#26
msto wants to merge 20 commits into
mainfrom
ms_20_template

Conversation

@msto

@msto msto commented Dec 6, 2025

Copy link
Copy Markdown

Part of #20

Adds the Template type for grouping BAM alignments by query name — the first BAM-related module in fgoxide. Modeled directly on the existing Template implementations in fgbio (Scala) and fgpyo (Python); the module rustdoc cites both upstream sources at pinned commits.

API

  • Template::new(records) — validating constructor
  • Template::new_unchecked(records) — fast path for callers with pre-grouped input (e.g. records from a query-name-grouped iterator); mirrors fgpyo's build(validate=False)
  • Accessors: name, all_recs, primary_recs, all_r1s, all_r2s, len, is_empty
  • TemplateError carries Vec<u8> qnames so non-UTF-8 names round-trip losslessly
  • all_recs order follows fgbio's Scala (r1, r2, r1_supp, r2_supp, r1_sec, r2_sec); fgpyo's interleaved order is acknowledged inline

Dependencies / MSRV

Adds rust-htslib 0.51, which uses <u{N}>::is_multiple_of (stable in Rust 1.87). MSRV bumped to 1.87 accordingly. thiserror 2.x came in via the recent main upgrade.

Review iteration

Incorporates @nh13's suggestions: new_unchecked, EmptyTemplate variant, Vec<u8> qnames, Clone, all_r1s/all_r2s/len, fgbio iteration order, and runnable doctests. Naming open: supplementaries (matches the SAM spec) is currently in tree; awaiting confirmation that this is preferred over supplementals. @theJasonFan's TryFrom<IntoIterator> suggestion is held for now since new() already accepts IntoIterator directly.

Test plan

97 unit tests + 6 doctests covering paired and single-end inputs, all four secondary/supplementary buckets, fgbio's "secondary AND supplementary → secondaries" precedence, every error variant including non-UTF-8 qnames, Default/Clone, the LAST_IN_PAIR-without-PAIRED edge case, R1+R2 supplementaries with both primaries, and new_unchecked's documented "last primary wins" overwrite semantics.

@msto msto force-pushed the ms_20_template branch 2 times, most recently from 025bbe6 to 60bdaea Compare December 6, 2025 21:13
Comment thread src/bam/mod.rs Outdated
Comment thread src/bam/mod.rs Outdated
Comment thread src/bam/mod.rs
Comment thread src/bam/mod.rs Outdated
@msto msto marked this pull request as ready for review December 6, 2025 21:18
@msto msto requested review from nh13 and tfenne as code owners December 6, 2025 21:18
@msto msto requested a review from theJasonFan December 6, 2025 21:18
@msto msto assigned nh13 and tfenne Dec 6, 2025
@msto msto mentioned this pull request Dec 7, 2025
Comment thread src/bam/mod.rs Outdated

@nh13 nh13 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments and suggestions based on looking at what we've done in fgpyo/fgbio (not they are perfect).

Comment thread src/bam/mod.rs
/// Secondary alignments for R2.
pub r2_secondaries: Vec<Record>,
/// Supplementary alignments for R1.
pub r1_supplementaries: Vec<Record>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we use supplementals like we do in fgbio/fgpyo?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SAM specification uses "supplementary"; I think the use of "supplemental" is specific to these two packages. Would you be open to using supplementaries, since that's consistent with spec? (I always typo the fgpyo methods because I think of supplementary alignments.)

https://samtools.github.io/hts-specs/SAMv1.pdf

If you agree, I can open issues (and PRs) on fgpyo and fgbio to rename these methods to use supplementaries (and keep the supplementals helpers, just with deprecation warnings).

Comment thread src/bam/mod.rs
Comment thread src/bam/mod.rs
Comment thread src/bam/mod.rs
Comment thread src/bam/mod.rs Outdated
Comment thread src/bam/mod.rs Outdated
///
/// # Examples
///
/// ```ignore

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why not make these runnable? Consider adding a test helper module or using no_run if the issue is just runtime dependencies.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make them runnable!

Comment thread src/bam/mod.rs Outdated
@msto msto assigned msto and unassigned nh13 and tfenne Apr 15, 2026
msto and others added 14 commits May 7, 2026 16:29
Add a method to retrieve the query name from a Template.
Returns the query name from the first available record,
checking in order: r1, r2, r1_secondaries, r2_secondaries,
r1_supplementaries, r2_supplementaries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add an iterator method that yields all records in the template.
Records are yielded in order: r1, r2, r1_secondaries, r2_secondaries,
r1_supplementaries, r2_supplementaries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add an iterator method that yields only the primary records (r1 and r2).
This is a convenience method for iterating over just the primary alignments.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add the bam module to the public API, making Template and TemplateError
accessible to users of the crate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test module with tests for:
- Building template from paired-end primaries
- Template::name() returning query name
- Template::all_recs() iterator
- Template::primary_recs() iterator
- R1-only and R2-only templates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests verifying that unpaired reads (without PAIRED flag)
are correctly placed in r1 following the fgbio convention.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests verifying:
- Secondary R1/R2 alignments are placed correctly
- Supplementary R1/R2 alignments are placed correctly
- Records with both secondary AND supplementary flags go to secondaries
  (per fgbio convention: secondary flag is checked first)
- Multiple secondaries and supplementaries are handled correctly
- all_recs() includes all alignment types

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests verifying that Template::build() returns appropriate errors:
- MismatchedQueryNames when records have different query names
- MultiplePrimaryR1 when multiple primary R1 records are found
- MultiplePrimaryR2 when multiple primary R2 records are found
- Multiple unpaired reads correctly trigger MultiplePrimaryR1

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests for edge cases and special scenarios:
- Empty template (no records)
- Template::name() from secondary-only records
- Template::name() from supplementary-only records
- Default template via Template::default()
- Verification of all_recs() ordering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Accept any IntoIterator instead of requiring Vec<Record>. This allows
callers to pass iterators directly without collecting to Vec first.
Leverage the existing all_recs() iterator which already has the correct
priority order, instead of chaining multiple or_else calls.
msto and others added 3 commits May 7, 2026 16:29
Add standard traits to TemplateError for easier testing and error
handling. Clone is useful for retrying operations, and PartialEq/Eq
enable direct equality assertions in tests.
Replace match blocks with direct assert_eq! calls now that TemplateError
derives PartialEq.
Follow Rust conventions for the primary constructor method.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@msto msto force-pushed the ms_20_template branch from 5f0f992 to d2b624c Compare May 7, 2026 20:47
Incorporates Nils's review thread (accepted items) plus findings from a
/simplify + code-review pass.

From Nils:
- Derive Clone on Template
- Add EmptyTemplate error; new(vec![]) now errors
- Store qnames as Vec<u8> in TemplateError (preserves non-UTF-8 bytes)
- Add all_r1s(), all_r2s(), len(), is_empty()
- Reorder all_recs() to match fgbio Scala: r1, r2, r1_supp, r2_supp,
  r1_sec, r2_sec
- Add new_unchecked() for hot paths where invariants are guaranteed
- Doctests are runnable (no more `ignore`)

From the review pass:
- Drop per-record Vec<u8> allocation in new() hot path; hoist the
  String allocation out of the success path
- Share a private push() helper between new() and new_unchecked() to
  remove duplicated classification dispatch
- Flatten the 3-level if-cascade into a single match on (Bucket, is_r1)
- Short-circuit name() with an or_else chain
- Document that direct field mutation can void invariants
- Add tests for: R1+R2 supplementaries with both primaries,
  LAST_IN_PAIR without PAIRED, non-UTF-8 qnames in error round-trip,
  len/is_empty/all_r1s/all_r2s/Clone

Open: supplementaries vs supplementals naming pending Nils's reply;
TryFrom<IntoIterator> deferred (new() already takes IntoIterator).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@msto msto force-pushed the ms_20_template branch from d2b624c to e9b4489 Compare May 7, 2026 20:51
msto and others added 2 commits May 7, 2026 16:53
rust-htslib 0.51 uses `<u{N}>::is_multiple_of`, stabilized in Rust 1.87.
Without this bump the MSRV CI job fails on `cargo check`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cite the fgbio (Scala) and fgpyo (Python) Template implementations the
Rust port was modeled on, with verified line ranges at pinned commits so
future readers can compare against the exact source we ported from.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@msto msto assigned nh13 and unassigned msto May 7, 2026
@msto msto requested a review from nh13 May 7, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants