feat!: allow non-terminal + segment in read structures#1157
Conversation
The `+` (indefinite-length) segment can now appear at any position within a read structure, not only the last, mirroring the Rust `read-structure` crate PR #14 on the Scala side. Extract resolves middle- and leading-+ structures by walking backward from the end of the read for post-+ segments. Breaking API changes: - ReadSegment.offset removed; ReadStructure tracks offsets internally via a private signed array. Constructors now take (length, kind) instead of (offset, length, kind). - ReadStructure.apply(segments, resetOffsets) simplified to apply(segments); offsets are always computed. - Fixed-length structures reject over-long reads instead of silently truncating. Callers that may see variable-length input should route it through withVariableLastSegment. - extract gained an includeSkips: Boolean = false parameter; Skip segments are now omitted from extract results by default. All in-tree callers were already filtering by kind, so no behavior change for them. ExtractUmisFromBam.updateClippingInformation now walks segments accumulating a running non-template-length sum before the clip position. It requires all but the last segment to be fixed-length, since translating a clip past a non-terminal + would need the read length. The + segment still means zero-or-more bases here, deliberately differing from the Rust crate's one-or-more; DemuxFastqs relies on zero-or-more to handle short/trimmed reads via withVariableLastSegment.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1157 +/- ##
==========================================
+ Coverage 95.90% 95.97% +0.06%
==========================================
Files 133 133
Lines 8327 8390 +63
Branches 942 1015 +73
==========================================
+ Hits 7986 8052 +66
+ Misses 341 338 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughWalkthroughThis pull request revises read-structure handling and docs: CLI help now allows at most one '+' segment anywhere; ReadSegment no longer stores offsets and ReadStructure computes per-segment offsets and supports non-terminal '+'; extract(...) gains includeSkips and omits Skip by default; ExtractUmisFromBam recalculates clipping positions against the new segment model; tests expanded to cover boundary and '+' cases. 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala (1)
63-106: Good coverage of the new+-at-any-position rule: multi-+rejection via both strings and sequences, round-trip throughtoString, and non-positive-length rejection.Might also be worth asserting that a non-adjacent multi-
+(e.g.ReadStructure("8B+M10T+S")) rejects, to guard the "at most one" invariant across the whole structure rather than just adjacent cases. Optional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala` around lines 63 - 106, Add a test asserting that non-adjacent multiple '+' segments are rejected to ensure the "at most one + anywhere" invariant; specifically, update ReadStructureTest (tests around the "accept the + segment at any position, at most once" block) to include an assertion that ReadStructure("8B+M10T+S") (and similar non-adjacent multi-+ strings) throws an Exception using the same an[Exception] shouldBe thrownBy pattern so the parser rejects multiple '+' even when they are not adjacent.src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala (1)
430-449: Good boundary coverage — especially the "clip exactly on a segment boundary" cases, which are where theoffset < clippingPositionloop guard matters most.All four expected values check out against the new implementation. Optional: consider also asserting that a non-terminal
+withclippingAttributethrows the newrequire(e.g.ReadStructure("10M+M5T")withXTset), to lock in the precondition.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala` around lines 430 - 449, Add a negative test asserting that the new precondition (the require) rejects a non-terminal '+' clipping operator when a clippingAttribute is present: in ExtractUmisFromBamTest.scala add a case that sets record("XT") to some value and calls ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("10M+M5T")) (or similar with a non-terminal '+') and assert that the call throws the expected exception; reference ExtractUmisFromBam.updateClippingInformation, ReadStructure and the clipping attribute name "XT" when locating where to add the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala`:
- Around line 58-62: The wording uses the jargon "monotemplate" in the paragraph
describing the read-structure example (which shows `5M5S+T`), but that term is
confusing because the example's second five bases are a skip (`5S`); update the
sentence in FastqToBam.scala (the paragraph around the `5M5S+T` example) to say
the second five bases "should be skipped" or "are non-template/skipped" instead
of "monotemplate" so the description matches the `5S` operation.
In `@src/main/scala/com/fulcrumgenomics/util/ReadStructure.scala`:
- Around line 253-301: Defaulting includeSkips to false in the extract overloads
(see extract(bases: String): Seq[SubReadWithoutQuals] and extract(bases: String,
quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals]) is a
silent breaking change for out-of-tree callers who relied on Skip segments;
update the release notes/changelog to call this out as a breaking change and
also update the Scaladoc for both extract methods to explicitly document the new
default behavior and recommend callers pass includeSkips = true if they relied
on Skip entries.
---
Nitpick comments:
In `@src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala`:
- Around line 430-449: Add a negative test asserting that the new precondition
(the require) rejects a non-terminal '+' clipping operator when a
clippingAttribute is present: in ExtractUmisFromBamTest.scala add a case that
sets record("XT") to some value and calls
ExtractUmisFromBam.updateClippingInformation(record, Some("XT"),
ReadStructure("10M+M5T")) (or similar with a non-terminal '+') and assert that
the call throws the expected exception; reference
ExtractUmisFromBam.updateClippingInformation, ReadStructure and the clipping
attribute name "XT" when locating where to add the test.
In `@src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala`:
- Around line 63-106: Add a test asserting that non-adjacent multiple '+'
segments are rejected to ensure the "at most one + anywhere" invariant;
specifically, update ReadStructureTest (tests around the "accept the + segment
at any position, at most once" block) to include an assertion that
ReadStructure("8B+M10T+S") (and similar non-adjacent multi-+ strings) throws an
Exception using the same an[Exception] shouldBe thrownBy pattern so the parser
rejects multiple '+' even when they are not adjacent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9a3c5f08-44b3-4474-9c6f-85f3f3bfff9b
📒 Files selected for processing (7)
src/main/scala/com/fulcrumgenomics/fastq/DemuxFastqs.scalasrc/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scalasrc/main/scala/com/fulcrumgenomics/illumina/RunInfo.scalasrc/main/scala/com/fulcrumgenomics/umi/ExtractUmisFromBam.scalasrc/main/scala/com/fulcrumgenomics/util/ReadStructure.scalasrc/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scalasrc/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala
| /** Extracts the bases corresponding to each read segment. [[SegmentType.Skip]] segments are omitted | ||
| * from the returned sequence because callers almost always throw those bases away; use the two-argument | ||
| * overload with `includeSkips = true` to keep them. */ | ||
| def extract(bases: String): Seq[SubReadWithoutQuals] = extract(bases, includeSkips = false) | ||
|
|
||
| /** Extracts the bases corresponding to each read segment. | ||
| * | ||
| * @param bases the raw read bases | ||
| * @param includeSkips whether to include [[SegmentType.Skip]] segments in the returned sequence | ||
| */ | ||
| def extract(bases: String, includeSkips: Boolean): Seq[SubReadWithoutQuals] = { | ||
| validateReadLength(bases.length) | ||
| val readLen = bases.length | ||
| val builder = IndexedSeq.newBuilder[SubReadWithoutQuals] | ||
| var i = 0 | ||
| while (i < segments.length) { | ||
| val seg = segments(i) | ||
| if (includeSkips || seg.kind != SegmentType.Skip) { | ||
| val (start, end) = spanOf(i, readLen) | ||
| builder += SubReadWithoutQuals(bases.substring(start, end), seg) | ||
| } | ||
| i += 1 | ||
| } | ||
| builder.result() | ||
| } | ||
|
|
||
| /** Extracts the bases and qualities corresponding to each read segment. | ||
| * | ||
| * @param bases the raw read bases | ||
| * @param quals the raw read qualities; must be the same length as `bases` | ||
| * @param includeSkips whether to include [[SegmentType.Skip]] segments in the returned sequence; | ||
| * defaults to `false` because callers almost always throw those bases away. | ||
| */ | ||
| def extract(bases: String, quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals] = { | ||
| require(bases.length == quals.length, s"Bases and quals differ in length: ${bases.length} vs ${quals.length}") | ||
| validateReadLength(bases.length) | ||
| val readLen = bases.length | ||
| val builder = IndexedSeq.newBuilder[SubReadWithQuals] | ||
| var i = 0 | ||
| while (i < segments.length) { | ||
| val seg = segments(i) | ||
| if (includeSkips || seg.kind != SegmentType.Skip) { | ||
| val (start, end) = spanOf(i, readLen) | ||
| builder += SubReadWithQuals(bases.substring(start, end), quals.substring(start, end), seg) | ||
| } | ||
| i += 1 | ||
| } | ||
| builder.result() | ||
| } |
There was a problem hiding this comment.
Skip-exclusion default is a silent semantic change — worth a release-notes mention.
In-tree callers (DemuxFastqs, FastqToBam, ExtractUmisFromBam) all post-filter by kind, so this is safe here. Out-of-tree callers that previously iterated extract(bases) without filtering will silently lose Skip entries. Given feat!, that's acceptable — just call it out in the changelog alongside the other breaking items.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/main/scala/com/fulcrumgenomics/util/ReadStructure.scala` around lines 253
- 301, Defaulting includeSkips to false in the extract overloads (see
extract(bases: String): Seq[SubReadWithoutQuals] and extract(bases: String,
quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals]) is a
silent breaking change for out-of-tree callers who relied on Skip segments;
update the release notes/changelog to call this out as a breaking change and
also update the Scaladoc for both extract methods to explicitly document the new
default behavior and recommend callers pass includeSkips = true if they relied
on Skip entries.
clintval
left a comment
There was a problem hiding this comment.
Why does the Scala implementation allow a zero-length + operator but the Rust implementation does not? Would you be willing to update our wiki docs on read structures here too?
https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures
| |molecular identifiers will be concatenated using | ||
| |the `-` delimiter and placed in the given SAM record tag (`RX` by default). Similarly, the sample barcode bases |
There was a problem hiding this comment.
Line lengths are a bit broken here.
Should render fine, but looks odd in the source code.
| // molecular-barcode prefix that gets stripped: a clip past the 10M should shift down by 10 | ||
| record("XT") = 50 | ||
| ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("10M90T")) | ||
| record[Int]("XT") shouldBe 40 | ||
|
|
||
| // clip position exactly on a segment boundary (end of the skip) | ||
| record("XT") = 30 | ||
| ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("20T10S70T")) | ||
| record[Int]("XT") shouldBe 20 | ||
|
|
||
| // clip position exactly on the next segment's start when that segment is a template | ||
| record("XT") = 20 | ||
| ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("10M10T80T")) | ||
| record[Int]("XT") shouldBe 10 | ||
|
|
||
| // multiple template segments with a mol barcode between them; clip lands in the second template | ||
| record("XT") = 40 | ||
| ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("20T10M70T")) | ||
| record[Int]("XT") shouldBe 30 |
There was a problem hiding this comment.
Decent tests, but did you mean to add any read segments with the + operator before/after the UMI segment to flex the new functionality?
There was a problem hiding this comment.
I've added a positive test for a trailing +T, but the tool actually explicitly checks for and forbids a + segment in the non terminal position (separate and apart from the read-structure parsing), so I've added a test that that check still fires.
Throughout this PR I've avoided changing behavior other than the RS parsing. I think we could entertain separate PRs to loosen restrictions in tools where it makes sense.
Honestly, I think it should be that I will update the wiki docs once this PR is merged. |
Add a positive regression case for a terminal + segment with a clipping attribute, and a negative case asserting the non-terminal + precondition fires. Also reflow the read-structure paragraph in DemuxFastqs so the source no longer wraps mid-clause. Both per PR #1157 review feedback.
The
+(indefinite-length) segment can now appear at any position within a read structure, not only the last, mirroring the Rustread-structurecrate PR #14 on the Scala side. Extract resolves middle- and leading-+ structures by walking backward from the end of the read for post-+ segments.Breaking API changes:
ExtractUmisFromBam.updateClippingInformation now walks segments accumulating a running non-template-length sum before the clip position. It requires all but the last segment to be fixed-length, since translating a clip past a non-terminal + would need the read length.
The + segment still means zero-or-more bases here, deliberately differing from the Rust crate's one-or-more; DemuxFastqs relies on zero-or-more to handle short/trimmed reads via
withVariableLastSegment.