fix(codec): avoid MatchError on dovetail FR pairs with asymmetric isFrPair#1155
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1155 +/- ##
=======================================
Coverage 95.94% 95.95%
=======================================
Files 132 132
Lines 8331 8347 +16
Branches 940 984 +44
=======================================
+ Hits 7993 8009 +16
Misses 338 338
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe PR refactors primary-read filtering in CodecConsensusCaller to use pair-level FR-pair classification instead of per-record flags. Records are grouped by read name in input order, templates with exactly two primary FR reads are retained (others rejected via NotPrimaryFrPair), mate-end clipping is applied only to FR templates, and primaries are built from those reads. A companion object provides orderedPrimaryPairs and isPrimaryFrPair helpers. Tests cover degenerate templates, ordering, supplementary/secondary behavior, and FR classification. 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala (2)
392-408: Minor:isPrimaryFrPairdoes not actually check the "primary" flags. The "primary" filtering lives upstream at line 180 (filterNot(r => r.secondary || r.supplementary)); this helper only validates FR geometry. Consider renaming toisFrPair(or having it also assert!a.secondary && !a.supplementary && !b.secondary && !b.supplementary) so the name matches what it guarantees. Not blocking — just prevents future misuse if the helper leaks to other call sites.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala` around lines 392 - 408, isPrimaryFrPair's name is misleading because it doesn't check primary/secondary flags; either rename it to isFrPair or make it verify that both SamRecord a and b are primary (i.e. !secondary && !supplementary) to match the name. Update the helper (isPrimaryFrPair) to either be renamed to isFrPair everywhere it's referenced or add assertions/guards that a.secondary, a.supplementary, b.secondary and b.supplementary are all false before performing the existing geometry checks; also update any callers or documentation to reflect the new name or the added precondition. Ensure references to isPrimaryFrPair in CodecConsensusCaller and tests are updated accordingly to avoid broken references.
183-189: Nice defensive refactor. Grouping by name first and pattern-matching with acase _ => falsefallthrough cleanly handles singleton and ≥3-record buckets, and moving the FR check to the pair level removes the per-record asymmetry. One observation: orphan/singleton buckets are now reported underNotPrimaryFrPair, which conflates "no mate present" with "mate present but wrong orientation". If downstream metrics consumers want to distinguish these cases, a separateOrphanPrimaryreason would be cleaner — but fine to defer.src/test/scala/com/fulcrumgenomics/umi/CodecConsensusCallerTest.scala (1)
329-337: Symmetry test passes even if both sides returnfalse. Consider also pinning the expected outcome so a future regression that classifies this dovetail as non-FR on both orderings still fails the test.♻️ Suggested tightening
- CodecConsensusCaller.isPrimaryFrPair(r1, r2) shouldBe CodecConsensusCaller.isPrimaryFrPair(r2, r1) + CodecConsensusCaller.isPrimaryFrPair(r1, r2) shouldBe true + CodecConsensusCaller.isPrimaryFrPair(r2, r1) shouldBe true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/test/scala/com/fulcrumgenomics/umi/CodecConsensusCallerTest.scala` around lines 329 - 337, The test currently only asserts symmetry by comparing CodecConsensusCaller.isPrimaryFrPair(r1, r2) to the reverse call, which will still pass if both return false; change the test to pin the expected outcome by computing val result = CodecConsensusCaller.isPrimaryFrPair(r1, r2), asserting that CodecConsensusCaller.isPrimaryFrPair(r2, r1) equals result, and then assert result against the correct boolean literal (e.g. result shouldBe true) so future regressions that make this dovetail non-FR on both orderings will fail; update the test named "CodecConsensusCaller.isPrimaryFrPair should return the same answer regardless of argument order" accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala`:
- Around line 392-408: isPrimaryFrPair's name is misleading because it doesn't
check primary/secondary flags; either rename it to isFrPair or make it verify
that both SamRecord a and b are primary (i.e. !secondary && !supplementary) to
match the name. Update the helper (isPrimaryFrPair) to either be renamed to
isFrPair everywhere it's referenced or add assertions/guards that a.secondary,
a.supplementary, b.secondary and b.supplementary are all false before performing
the existing geometry checks; also update any callers or documentation to
reflect the new name or the added precondition. Ensure references to
isPrimaryFrPair in CodecConsensusCaller and tests are updated accordingly to
avoid broken references.
In `@src/test/scala/com/fulcrumgenomics/umi/CodecConsensusCallerTest.scala`:
- Around line 329-337: The test currently only asserts symmetry by comparing
CodecConsensusCaller.isPrimaryFrPair(r1, r2) to the reverse call, which will
still pass if both return false; change the test to pin the expected outcome by
computing val result = CodecConsensusCaller.isPrimaryFrPair(r1, r2), asserting
that CodecConsensusCaller.isPrimaryFrPair(r2, r1) equals result, and then assert
result against the correct boolean literal (e.g. result shouldBe true) so future
regressions that make this dovetail non-FR on both orderings will fail; update
the test named "CodecConsensusCaller.isPrimaryFrPair should return the same
answer regardless of argument order" accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0735a5f0-50c4-4b42-8e39-aefc1fae4358
📒 Files selected for processing (3)
src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scalasrc/main/scala/com/fulcrumgenomics/umi/UmiConsensusCaller.scalasrc/test/scala/com/fulcrumgenomics/umi/CodecConsensusCallerTest.scala
| // `scala.MatchError` on a singleton read-name bucket here. | ||
| // TODO: handle chimeric reads or reads with one unmapped etc. |
There was a problem hiding this comment.
Can you link to the HTSJDK PR here, so we know when we can revert to it's FrPair code?
| if (!a.mapped || !b.mapped) false | ||
| else if (a.refIndex != b.refIndex) false | ||
| else if (a.positiveStrand == b.positiveStrand) false | ||
| else { | ||
| val (fwd, rev) = if (a.positiveStrand) (a, b) else (b, a) | ||
| fwd.unclippedStart <= rev.unclippedEnd | ||
| } | ||
| } |
There was a problem hiding this comment.
Could you just find the higher coordinate of the two reads, and then call htsjdk's method here? Your main problem was lack of consistency, and doing anything on the pair gets you there.
d98c1b6 to
de8e5d4
Compare
…rPair CodecConsensusCaller filtered primaries with a per-record `isFrPair` check then pattern-matched each read-name bucket as `Seq(rec, mate)`. For dovetail FR pairs whose aligned ends coincide (e.g. soft-clipping producing TLEN=+/-1), htsjdk's `SamPairUtil.getPairOrientation` returns FR for R1 but RF for R2, which kept one mate and dropped the other and produced a singleton bucket that blew up with `scala.MatchError`. Group primaries by read name first and keep only buckets that form a single primary FR pair, where the FR check is computed at the pair level from the unclipped 5' positions of both records and is therefore symmetric in its arguments. Templates that do not form a clean primary FR pair (singletons, tandems, RF, cross-contig) are now rejected cleanly with a new `NotPrimaryFrPair` reason instead of crashing. Preserve BAM-iteration order of templates when grouping by read name. The previous `groupBy(_.name).toSeq` produced a template order that depended on the JVM's `HashMap` layout, which propagated through `filterToMostCommonAlignment` and caused `maxBy(_.cigar.lengthOnTarget)` to break ties in hash order when selecting `longestR1Alignment` / `longestR2Alignment`. That could pair records from different templates, producing spurious `IndelErrorBetweenStrands` rejections when the resulting overlap geometry did not agree across the strands. A new `orderedPrimaryPairs` helper groups by read name but orders templates by first-occurrence in the input. Includes unit tests for the new helpers and a regression test that reproduces the exact geometry (68S53M8S / 28S48M53S, aligned ends coinciding) observed in real CODEC data.
Replace the bespoke unclipped-5'-position comparison with a delegating call to `SamPairUtil.getPairOrientation` evaluated on the reverse-strand record. That branch derives both 5' positions from CIGARs (`mate.alignmentStart` vs `record.alignmentEnd`) and is independent of MC/TLEN, so the answer is exact and consistent regardless of which record of the pair is passed first. The per-record asymmetry that motivated this workaround is fixed upstream by samtools/htsjdk#1771 in htsjdk 5.0.0; once we drop pre-5.0.0 support the helper can be replaced with `SamRecord.isFrPair`. Inline link added to both the call-site comment and the helper docstring so future maintainers know when this can be removed. Also add a `mateMapped` guard so calling `getPairOrientation` is safe even on inputs whose mate flags are inconsistent with the records present.
…erting symmetry The previous test asserted only that `isPrimaryFrPair(r1, r2)` equals `isPrimaryFrPair(r2, r1)`, which would still pass if both calls returned `false`. Pin both calls to `true` so a future regression that classifies this dovetail as non-FR on both orderings still fails the test.
de8e5d4 to
4bd57cc
Compare
The previous implementation built a `HashMap` via `groupBy` and then a separate `HashSet` via `iterator.distinct` to enforce first-occurrence order, requiring two passes over the input and a per-record hash lookup in each. Replace both with a single-pass insertion into a `mutable.LinkedHashMap`, which natively preserves insertion order on iteration and uses just one hash structure.
Summary
CallCodecConsensusReadspreviously threwscala.MatchErrordeep in a worker thread after hundreds of millions of records on some CODEC datasets. The crash was deterministic once a template with a specific geometry was encountered.filter(_.isFrPair)immediately followed by a rigidSeq(rec, mate)pattern match atCodecConsensusCaller.scala:176. When a read-name bucket contained only one record, that match blew up.SamPairUtil.getPairOrientation: for dovetail FR pairs whose aligned ends coincide (producing TLEN=±1 after htsjdk's own insert-size math) the call returns FR for R1 but RF for R2 on the same pair. The CODEC caller kept one mate and dropped the other, then crashed on the resulting singleton bucket.groupBy(_.name).toSeqmadelongestR1Alignment/longestR2Alignmentselection non-deterministic across JVMs and could pair records from different templates, producing spuriousIndelErrorBetweenStrandsrejections.Fix
Group primary alignments by read name first, then keep only buckets that form a single primary FR pair, where the FR check is computed at the pair level from the unclipped 5′ positions of both records (so the answer is symmetric in its arguments). Templates that do not form a clean primary FR pair — singletons, tandem, RF, cross-contig, or with an unmapped mate — are now rejected cleanly with a new
NotPrimaryFrPairreason instead of throwing.Secondary fix: deterministic template ordering
consensusSamRecordsFromSamRecordsselects the longest R1 and R2 alignments viamaxBy(_.cigar.lengthOnTarget), which returns the first element in iteration order on ties. The previouspairs.filterNot(...).groupBy(_.name).toSeqpattern produced a template order that depended on the JVM'sString#hashCodeandHashMaplayout rather than BAM-iteration order. On ties this could pick R1 and R2 from different templates whose geometry did not agree, producing spuriousIndelErrorBetweenStrandsrejections.A new
CodecConsensusCaller.orderedPrimaryPairshelper groups by read name while preserving BAM-iteration order (order of first occurrence of each read name viaIterator#distinct). The caller now delegates to the helper. Verified locally thatSeq("zulu", "alpha", "mike", "bravo", "yankee").groupBy(identity).toSeq.map(_._1)yieldsList(alpha, yankee, bravo, zulu, mike)on OpenJDK 17 — a concrete example of the hash-order/BAM-order mismatch that the new test locks down.Minimal repro
The failing stack trace:
The offending template has both R1 and R2 present as primary alignments, both mapped to the same contig, same MI, same UMI, and an identical aligned boundary — the reverse-strand read's
alignmentEndcoincides with the forward-strand read'salignmentStart, and heavy soft-clipping at both ends causes htsjdk to computeTLEN=±1.Schematically:
With
X = R2.alignmentEnd = R1.alignmentStart, htsjdk'sgetPairOrientationreturns FR on R1 (comparingR1.alignmentStarttoR1.alignmentStart + TLEN = X+1) and RF on R2 (comparingR1.alignmentStarttoR2.alignmentEnd, bothX, strict<is false). Per-record disagreement → one mate filtered → singleton bucket →MatchError.The geometry is exercised directly in a new unit test using
SamBuilder— no real data or BAMs committed. A separate smoke-test against a 2-record BAM extracted from a real offending template (not committed) confirmed the tool no longer crashes and instead rejects the template asIndelErrorBetweenStrandsdownstream, which is expected — the CODEC caller isn't wired for dovetails whose aligned regions barely overlap; that is a separate, larger change.Tests
CodecConsensusCaller.isPrimaryFrPair: unit tests for symmetry, canonical FR, RF, cross-chromosomal, one-mate-unmapped, tandem.CodecConsensusCaller.orderedPrimaryPairs: unit tests locking down BAM-iteration order using names thatHashMapwould reorder, plus a test documenting the "caller pre-filters secondary/supplementary" contract.consensusSamRecordsFromSamRecordsregression test that reproduces the dovetail geometry (68S53M8S/28S48M53S, aligned ends coinciding) and asserts the caller does not throw.Full
sbt testpasses locally.Follow-ups (out of scope)
SamPairUtil.getPairOrientationis arguably a bug; a separate htsjdk issue/PR is planned.SamRecord.isFrPairelsewhere in fgbio inherit the same per-record asymmetry but do not crash; they may produce latent correctness issues on dovetail pairs and are worth auditing separately.Test plan
main.groupBy(_.name).toSeqpath.sbt testpasses.