Skip to content

fix(codec): avoid MatchError on dovetail FR pairs with asymmetric isFrPair#1155

Merged
tfenne merged 4 commits into
mainfrom
nh_fix-codec-matcherror-on-asymmetric-frpair
May 19, 2026
Merged

fix(codec): avoid MatchError on dovetail FR pairs with asymmetric isFrPair#1155
tfenne merged 4 commits into
mainfrom
nh_fix-codec-matcherror-on-asymmetric-frpair

Conversation

@nh13

@nh13 nh13 commented Apr 17, 2026

Copy link
Copy Markdown
Member

Summary

  • CallCodecConsensusReads previously threw scala.MatchError deep in a worker thread after hundreds of millions of records on some CODEC datasets. The crash was deterministic once a template with a specific geometry was encountered.
  • The failure site is a per-record filter(_.isFrPair) immediately followed by a rigid Seq(rec, mate) pattern match at CodecConsensusCaller.scala:176. When a read-name bucket contained only one record, that match blew up.
  • Root cause is an asymmetry in htsjdk's SamPairUtil.getPairOrientation: for dovetail FR pairs whose aligned ends coincide (producing TLEN=±1 after htsjdk's own insert-size math) the call returns FR for R1 but RF for R2 on the same pair. The CODEC caller kept one mate and dropped the other, then crashed on the resulting singleton bucket.
  • A secondary fix in this PR also preserves BAM-iteration order of templates when grouping primaries by read name. The previous groupBy(_.name).toSeq made longestR1Alignment / longestR2Alignment selection non-deterministic across JVMs and could pair records from different templates, producing spurious IndelErrorBetweenStrands rejections.

Fix

Group primary alignments by read name first, then keep only buckets that form a single primary FR pair, where the FR check is computed at the pair level from the unclipped 5′ positions of both records (so the answer is symmetric in its arguments). Templates that do not form a clean primary FR pair — singletons, tandem, RF, cross-contig, or with an unmapped mate — are now rejected cleanly with a new NotPrimaryFrPair reason instead of throwing.

Secondary fix: deterministic template ordering

consensusSamRecordsFromSamRecords selects the longest R1 and R2 alignments via maxBy(_.cigar.lengthOnTarget), which returns the first element in iteration order on ties. The previous pairs.filterNot(...).groupBy(_.name).toSeq pattern produced a template order that depended on the JVM's String#hashCode and HashMap layout rather than BAM-iteration order. On ties this could pick R1 and R2 from different templates whose geometry did not agree, producing spurious IndelErrorBetweenStrands rejections.

A new CodecConsensusCaller.orderedPrimaryPairs helper groups by read name while preserving BAM-iteration order (order of first occurrence of each read name via Iterator#distinct). The caller now delegates to the helper. Verified locally that Seq("zulu", "alpha", "mike", "bravo", "yankee").groupBy(identity).toSeq.map(_._1) yields List(alpha, yankee, bravo, zulu, mike) on OpenJDK 17 — a concrete example of the hash-order/BAM-order mismatch that the new test locks down.

Minimal repro

The failing stack trace:

Exception in thread "main" scala.MatchError: (<readname>, Vector(<readname> 1/2 129b aligned to <chrom>:X-Y.)) (of class scala.Tuple2)
  at com.fulcrumgenomics.umi.CodecConsensusCaller.$anonfun$consensusSamRecordsFromSamRecords$8(CodecConsensusCaller.scala:176)

The offending template has both R1 and R2 present as primary alignments, both mapped to the same contig, same MI, same UMI, and an identical aligned boundary — the reverse-strand read's alignmentEnd coincides with the forward-strand read's alignmentStart, and heavy soft-clipping at both ends causes htsjdk to compute TLEN=±1.

Schematically:

R1  FLAG=97   POS=X     CIGAR=68S53M8S   TLEN=+1   (forward)
R2  FLAG=145  POS=X-47  CIGAR=28S48M53S  TLEN=-1   (reverse)

With X = R2.alignmentEnd = R1.alignmentStart, htsjdk's getPairOrientation returns FR on R1 (comparing R1.alignmentStart to R1.alignmentStart + TLEN = X+1) and RF on R2 (comparing R1.alignmentStart to R2.alignmentEnd, both X, strict < is false). Per-record disagreement → one mate filtered → singleton bucket → MatchError.

The geometry is exercised directly in a new unit test using SamBuilder — no real data or BAMs committed. A separate smoke-test against a 2-record BAM extracted from a real offending template (not committed) confirmed the tool no longer crashes and instead rejects the template as IndelErrorBetweenStrands downstream, which is expected — the CODEC caller isn't wired for dovetails whose aligned regions barely overlap; that is a separate, larger change.

Tests

  • CodecConsensusCaller.isPrimaryFrPair: unit tests for symmetry, canonical FR, RF, cross-chromosomal, one-mate-unmapped, tandem.
  • CodecConsensusCaller.orderedPrimaryPairs: unit tests locking down BAM-iteration order using names that HashMap would reorder, plus a test documenting the "caller pre-filters secondary/supplementary" contract.
  • consensusSamRecordsFromSamRecords regression test that reproduces the dovetail geometry (68S53M8S / 28S48M53S, aligned ends coinciding) and asserts the caller does not throw.
  • Singleton-bucket regression test (defensive path) that asserts the caller rejects cleanly instead of crashing.

Full sbt test passes locally.

Follow-ups (out of scope)

  • The per-record asymmetry in htsjdk's SamPairUtil.getPairOrientation is arguably a bug; a separate htsjdk issue/PR is planned.
  • Other uses of SamRecord.isFrPair elsewhere in fgbio inherit the same per-record asymmetry but do not crash; they may produce latent correctness issues on dovetail pairs and are worth auditing separately.

Test plan

  • Reproduce MatchError with a SamBuilder test built on current main.
  • Verify fix prevents the MatchError in that test.
  • Lock down BAM-iteration order of primary-pair templates via a dedicated unit test that would fail against the broken groupBy(_.name).toSeq path.
  • Verify fix on an offline smoke-test BAM extracted from a real failing MI family (not committed).
  • Full sbt test passes.

@codecov

codecov Bot commented Apr 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.95%. Comparing base (21470e2) to head (03f9769).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1155   +/-   ##
=======================================
  Coverage   95.94%   95.95%           
=======================================
  Files         132      132           
  Lines        8331     8347   +16     
  Branches      940      984   +44     
=======================================
+ Hits         7993     8009   +16     
  Misses        338      338           
Flag Coverage Δ
unittests 95.95% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Apr 17, 2026

Copy link
Copy Markdown
PR Preview Action v1.6.1

🚀 View preview at
https://fulcrumgenomics.github.io/fgbio/pr-preview/pr-1155/

Built to branch gh-pages at 2026-05-19 22:28 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@nh13 nh13 marked this pull request as ready for review April 17, 2026 20:15
@nh13 nh13 requested review from clintval and tfenne as code owners April 17, 2026 20:15
@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85c0d939-7a92-44db-acc2-b89d7740da05

📥 Commits

Reviewing files that changed from the base of the PR and between 4bd57cc and 03f9769.

📒 Files selected for processing (1)
  • src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala

📝 Walkthrough

Walkthrough

The PR refactors primary-read filtering in CodecConsensusCaller to use pair-level FR-pair classification instead of per-record flags. Records are grouped by read name in input order, templates with exactly two primary FR reads are retained (others rejected via NotPrimaryFrPair), mate-end clipping is applied only to FR templates, and primaries are built from those reads. A companion object provides orderedPrimaryPairs and isPrimaryFrPair helpers. Tests cover degenerate templates, ordering, supplementary/secondary behavior, and FR classification.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title directly addresses the root cause (asymmetric isFrPair on dovetail FR pairs) and the primary fix (avoiding MatchError), accurately summarizing the main changeset.
Description check ✅ Passed Description comprehensively explains the problem, root cause, both fixes, tests, and follow-ups—all directly related to the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nh_fix-codec-matcherror-on-asymmetric-frpair

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala (2)

392-408: Minor: isPrimaryFrPair does not actually check the "primary" flags. The "primary" filtering lives upstream at line 180 (filterNot(r => r.secondary || r.supplementary)); this helper only validates FR geometry. Consider renaming to isFrPair (or having it also assert !a.secondary && !a.supplementary && !b.secondary && !b.supplementary) so the name matches what it guarantees. Not blocking — just prevents future misuse if the helper leaks to other call sites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala` around
lines 392 - 408, isPrimaryFrPair's name is misleading because it doesn't check
primary/secondary flags; either rename it to isFrPair or make it verify that
both SamRecord a and b are primary (i.e. !secondary && !supplementary) to match
the name. Update the helper (isPrimaryFrPair) to either be renamed to isFrPair
everywhere it's referenced or add assertions/guards that a.secondary,
a.supplementary, b.secondary and b.supplementary are all false before performing
the existing geometry checks; also update any callers or documentation to
reflect the new name or the added precondition. Ensure references to
isPrimaryFrPair in CodecConsensusCaller and tests are updated accordingly to
avoid broken references.

183-189: Nice defensive refactor. Grouping by name first and pattern-matching with a case _ => false fallthrough cleanly handles singleton and ≥3-record buckets, and moving the FR check to the pair level removes the per-record asymmetry. One observation: orphan/singleton buckets are now reported under NotPrimaryFrPair, which conflates "no mate present" with "mate present but wrong orientation". If downstream metrics consumers want to distinguish these cases, a separate OrphanPrimary reason would be cleaner — but fine to defer.

src/test/scala/com/fulcrumgenomics/umi/CodecConsensusCallerTest.scala (1)

329-337: Symmetry test passes even if both sides return false. Consider also pinning the expected outcome so a future regression that classifies this dovetail as non-FR on both orderings still fails the test.

♻️ Suggested tightening
-    CodecConsensusCaller.isPrimaryFrPair(r1, r2) shouldBe CodecConsensusCaller.isPrimaryFrPair(r2, r1)
+    CodecConsensusCaller.isPrimaryFrPair(r1, r2) shouldBe true
+    CodecConsensusCaller.isPrimaryFrPair(r2, r1) shouldBe true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/test/scala/com/fulcrumgenomics/umi/CodecConsensusCallerTest.scala` around
lines 329 - 337, The test currently only asserts symmetry by comparing
CodecConsensusCaller.isPrimaryFrPair(r1, r2) to the reverse call, which will
still pass if both return false; change the test to pin the expected outcome by
computing val result = CodecConsensusCaller.isPrimaryFrPair(r1, r2), asserting
that CodecConsensusCaller.isPrimaryFrPair(r2, r1) equals result, and then assert
result against the correct boolean literal (e.g. result shouldBe true) so future
regressions that make this dovetail non-FR on both orderings will fail; update
the test named "CodecConsensusCaller.isPrimaryFrPair should return the same
answer regardless of argument order" accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala`:
- Around line 392-408: isPrimaryFrPair's name is misleading because it doesn't
check primary/secondary flags; either rename it to isFrPair or make it verify
that both SamRecord a and b are primary (i.e. !secondary && !supplementary) to
match the name. Update the helper (isPrimaryFrPair) to either be renamed to
isFrPair everywhere it's referenced or add assertions/guards that a.secondary,
a.supplementary, b.secondary and b.supplementary are all false before performing
the existing geometry checks; also update any callers or documentation to
reflect the new name or the added precondition. Ensure references to
isPrimaryFrPair in CodecConsensusCaller and tests are updated accordingly to
avoid broken references.

In `@src/test/scala/com/fulcrumgenomics/umi/CodecConsensusCallerTest.scala`:
- Around line 329-337: The test currently only asserts symmetry by comparing
CodecConsensusCaller.isPrimaryFrPair(r1, r2) to the reverse call, which will
still pass if both return false; change the test to pin the expected outcome by
computing val result = CodecConsensusCaller.isPrimaryFrPair(r1, r2), asserting
that CodecConsensusCaller.isPrimaryFrPair(r2, r1) equals result, and then assert
result against the correct boolean literal (e.g. result shouldBe true) so future
regressions that make this dovetail non-FR on both orderings will fail; update
the test named "CodecConsensusCaller.isPrimaryFrPair should return the same
answer regardless of argument order" accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0735a5f0-50c4-4b42-8e39-aefc1fae4358

📥 Commits

Reviewing files that changed from the base of the PR and between baf6ad3 and d98c1b6.

📒 Files selected for processing (3)
  • src/main/scala/com/fulcrumgenomics/umi/CodecConsensusCaller.scala
  • src/main/scala/com/fulcrumgenomics/umi/UmiConsensusCaller.scala
  • src/test/scala/com/fulcrumgenomics/umi/CodecConsensusCallerTest.scala

Comment on lines +177 to +178
// `scala.MatchError` on a singleton read-name bucket here.
// TODO: handle chimeric reads or reads with one unmapped etc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to the HTSJDK PR here, so we know when we can revert to it's FrPair code?

Comment on lines +400 to +407
if (!a.mapped || !b.mapped) false
else if (a.refIndex != b.refIndex) false
else if (a.positiveStrand == b.positiveStrand) false
else {
val (fwd, rev) = if (a.positiveStrand) (a, b) else (b, a)
fwd.unclippedStart <= rev.unclippedEnd
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just find the higher coordinate of the two reads, and then call htsjdk's method here? Your main problem was lack of consistency, and doing anything on the pair gets you there.

@nh13 nh13 force-pushed the nh_fix-codec-matcherror-on-asymmetric-frpair branch from d98c1b6 to de8e5d4 Compare April 18, 2026 07:11
@nh13 nh13 added the bug label Apr 19, 2026
nh13 and others added 3 commits May 19, 2026 15:55
…rPair

CodecConsensusCaller filtered primaries with a per-record `isFrPair` check
then pattern-matched each read-name bucket as `Seq(rec, mate)`. For dovetail
FR pairs whose aligned ends coincide (e.g. soft-clipping producing TLEN=+/-1),
htsjdk's `SamPairUtil.getPairOrientation` returns FR for R1 but RF for R2,
which kept one mate and dropped the other and produced a singleton bucket that
blew up with `scala.MatchError`.

Group primaries by read name first and keep only buckets that form a single
primary FR pair, where the FR check is computed at the pair level from the
unclipped 5' positions of both records and is therefore symmetric in its
arguments. Templates that do not form a clean primary FR pair (singletons,
tandems, RF, cross-contig) are now rejected cleanly with a new
`NotPrimaryFrPair` reason instead of crashing.

Preserve BAM-iteration order of templates when grouping by read name. The
previous `groupBy(_.name).toSeq` produced a template order that depended on
the JVM's `HashMap` layout, which propagated through `filterToMostCommonAlignment`
and caused `maxBy(_.cigar.lengthOnTarget)` to break ties in hash order when
selecting `longestR1Alignment` / `longestR2Alignment`. That could pair
records from different templates, producing spurious `IndelErrorBetweenStrands`
rejections when the resulting overlap geometry did not agree across the
strands. A new `orderedPrimaryPairs` helper groups by read name but orders
templates by first-occurrence in the input.

Includes unit tests for the new helpers and a regression test that reproduces
the exact geometry (68S53M8S / 28S48M53S, aligned ends coinciding) observed in
real CODEC data.
Replace the bespoke unclipped-5'-position comparison with a delegating call to
`SamPairUtil.getPairOrientation` evaluated on the reverse-strand record. That
branch derives both 5' positions from CIGARs (`mate.alignmentStart` vs
`record.alignmentEnd`) and is independent of MC/TLEN, so the answer is exact
and consistent regardless of which record of the pair is passed first. The
per-record asymmetry that motivated this workaround is fixed upstream by
samtools/htsjdk#1771 in htsjdk 5.0.0; once we drop pre-5.0.0 support the
helper can be replaced with `SamRecord.isFrPair`. Inline link added to both
the call-site comment and the helper docstring so future maintainers know
when this can be removed.

Also add a `mateMapped` guard so calling `getPairOrientation` is safe even on
inputs whose mate flags are inconsistent with the records present.
…erting symmetry

The previous test asserted only that `isPrimaryFrPair(r1, r2)` equals
`isPrimaryFrPair(r2, r1)`, which would still pass if both calls returned
`false`. Pin both calls to `true` so a future regression that classifies this
dovetail as non-FR on both orderings still fails the test.
@tfenne tfenne force-pushed the nh_fix-codec-matcherror-on-asymmetric-frpair branch from de8e5d4 to 4bd57cc Compare May 19, 2026 22:17
The previous implementation built a `HashMap` via `groupBy` and then a separate
`HashSet` via `iterator.distinct` to enforce first-occurrence order, requiring
two passes over the input and a per-record hash lookup in each. Replace both
with a single-pass insertion into a `mutable.LinkedHashMap`, which natively
preserves insertion order on iteration and uses just one hash structure.
@tfenne tfenne merged commit 1801b78 into main May 19, 2026
13 checks passed
@tfenne tfenne deleted the nh_fix-codec-matcherror-on-asymmetric-frpair branch May 19, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants