feat!: allow non-terminal + segment in read structures by tfenne · Pull Request #1157 · fulcrumgenomics/fgbio

tfenne · 2026-04-23T22:34:26Z

The + (indefinite-length) segment can now appear at any position within a read structure, not only the last, mirroring the Rust read-structure crate PR #14 on the Scala side. Extract resolves middle- and leading-+ structures by walking backward from the end of the read for post-+ segments.

Breaking API changes:

ReadSegment.offset removed; ReadStructure tracks offsets internally via a private signed array. Constructors now take (length, kind) instead of (offset, length, kind).
ReadStructure.apply(segments, resetOffsets) simplified to apply(segments); offsets are always computed.
Fixed-length structures reject over-long reads instead of silently truncating. Callers that may see variable-length input should route it through withVariableLastSegment.
extract gained an includeSkips: Boolean = false parameter; Skip segments are now omitted from extract results by default. All in-tree callers were already filtering by kind, so no behavior change for them.

ExtractUmisFromBam.updateClippingInformation now walks segments accumulating a running non-template-length sum before the clip position. It requires all but the last segment to be fixed-length, since translating a clip past a non-terminal + would need the read length.

The + segment still means zero-or-more bases here, deliberately differing from the Rust crate's one-or-more; DemuxFastqs relies on zero-or-more to handle short/trimmed reads via
withVariableLastSegment.

The `+` (indefinite-length) segment can now appear at any position within a read structure, not only the last, mirroring the Rust `read-structure` crate PR #14 on the Scala side. Extract resolves middle- and leading-+ structures by walking backward from the end of the read for post-+ segments. Breaking API changes: - ReadSegment.offset removed; ReadStructure tracks offsets internally via a private signed array. Constructors now take (length, kind) instead of (offset, length, kind). - ReadStructure.apply(segments, resetOffsets) simplified to apply(segments); offsets are always computed. - Fixed-length structures reject over-long reads instead of silently truncating. Callers that may see variable-length input should route it through withVariableLastSegment. - extract gained an includeSkips: Boolean = false parameter; Skip segments are now omitted from extract results by default. All in-tree callers were already filtering by kind, so no behavior change for them. ExtractUmisFromBam.updateClippingInformation now walks segments accumulating a running non-template-length sum before the clip position. It requires all but the last segment to be fixed-length, since translating a clip past a non-terminal + would need the read length. The + segment still means zero-or-more bases here, deliberately differing from the Rust crate's one-or-more; DemuxFastqs relies on zero-or-more to handle short/trimmed reads via withVariableLastSegment.

codecov · 2026-04-23T22:39:08Z

Codecov Report

❌ Patch coverage is 98.91304% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 95.97%. Comparing base (768d38c) to head (f0fed8e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...scala/com/fulcrumgenomics/util/ReadStructure.scala	98.68%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1157      +/-   ##
==========================================
+ Coverage   95.90%   95.97%   +0.06%     
==========================================
  Files         133      133              
  Lines        8327     8390      +63     
  Branches      942     1015      +73     
==========================================
+ Hits         7986     8052      +66     
+ Misses        341      338       -3

Flag	Coverage Δ
unittests	`95.97% <98.91%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-04-23T22:40:26Z

PR Preview Action v1.6.1
🚀 View preview at https://fulcrumgenomics.github.io/fgbio/pr-preview/pr-1157/
Built to branch `gh-pages` at 2026-05-19 21:17 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai · 2026-04-23T22:41:17Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d12a2a7b-82c3-4027-a009-0327deda8c2a

📥 Commits

Reviewing files that changed from the base of the PR and between a3c7fbe and f0fed8e.

📒 Files selected for processing (3)

src/main/scala/com/fulcrumgenomics/fastq/DemuxFastqs.scala
src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala
src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala

✅ Files skipped from review due to trivial changes (2)

src/main/scala/com/fulcrumgenomics/fastq/DemuxFastqs.scala
src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala

📝 Walkthrough

Walkthrough

This pull request revises read-structure handling and docs: CLI help now allows at most one '+' segment anywhere; ReadSegment no longer stores offsets and ReadStructure computes per-segment offsets and supports non-terminal '+'; extract(...) gains includeSkips and omits Skip by default; ExtractUmisFromBam recalculates clipping positions against the new segment model; tests expanded to cover boundary and '+' cases.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: allowing non-terminal + segments in read structures, matching the primary objective of the PR.
Description check	✅ Passed	The description comprehensively covers the changeset, explaining the feature, breaking API changes, and implementation details related to the modified files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tf_non_terminal_plus_in_rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala (1)
63-106: Good coverage of the new +-at-any-position rule: multi-+ rejection via both strings and sequences, round-trip through toString, and non-positive-length rejection.

Might also be worth asserting that a non-adjacent multi-+ (e.g. ReadStructure("8B+M10T+S")) rejects, to guard the "at most one" invariant across the whole structure rather than just adjacent cases. Optional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala` around lines
63 - 106, Add a test asserting that non-adjacent multiple '+' segments are
rejected to ensure the "at most one + anywhere" invariant; specifically, update
ReadStructureTest (tests around the "accept the + segment at any position, at
most once" block) to include an assertion that ReadStructure("8B+M10T+S") (and
similar non-adjacent multi-+ strings) throws an Exception using the same
an[Exception] shouldBe thrownBy pattern so the parser rejects multiple '+' even
when they are not adjacent.
src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala (1)
430-449: Good boundary coverage — especially the "clip exactly on a segment boundary" cases, which are where the offset < clippingPosition loop guard matters most.

All four expected values check out against the new implementation. Optional: consider also asserting that a non-terminal + with clippingAttribute throws the new require (e.g. ReadStructure("10M+M5T") with XT set), to lock in the precondition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala` around
lines 430 - 449, Add a negative test asserting that the new precondition (the
require) rejects a non-terminal '+' clipping operator when a clippingAttribute
is present: in ExtractUmisFromBamTest.scala add a case that sets record("XT") to
some value and calls ExtractUmisFromBam.updateClippingInformation(record,
Some("XT"), ReadStructure("10M+M5T")) (or similar with a non-terminal '+') and
assert that the call throws the expected exception; reference
ExtractUmisFromBam.updateClippingInformation, ReadStructure and the clipping
attribute name "XT" when locating where to add the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala`:
- Around line 58-62: The wording uses the jargon "monotemplate" in the paragraph
describing the read-structure example (which shows `5M5S+T`), but that term is
confusing because the example's second five bases are a skip (`5S`); update the
sentence in FastqToBam.scala (the paragraph around the `5M5S+T` example) to say
the second five bases "should be skipped" or "are non-template/skipped" instead
of "monotemplate" so the description matches the `5S` operation.

In `@src/main/scala/com/fulcrumgenomics/util/ReadStructure.scala`:
- Around line 253-301: Defaulting includeSkips to false in the extract overloads
(see extract(bases: String): Seq[SubReadWithoutQuals] and extract(bases: String,
quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals]) is a
silent breaking change for out-of-tree callers who relied on Skip segments;
update the release notes/changelog to call this out as a breaking change and
also update the Scaladoc for both extract methods to explicitly document the new
default behavior and recommend callers pass includeSkips = true if they relied
on Skip entries.

---

Nitpick comments:
In `@src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala`:
- Around line 430-449: Add a negative test asserting that the new precondition
(the require) rejects a non-terminal '+' clipping operator when a
clippingAttribute is present: in ExtractUmisFromBamTest.scala add a case that
sets record("XT") to some value and calls
ExtractUmisFromBam.updateClippingInformation(record, Some("XT"),
ReadStructure("10M+M5T")) (or similar with a non-terminal '+') and assert that
the call throws the expected exception; reference
ExtractUmisFromBam.updateClippingInformation, ReadStructure and the clipping
attribute name "XT" when locating where to add the test.

In `@src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala`:
- Around line 63-106: Add a test asserting that non-adjacent multiple '+'
segments are rejected to ensure the "at most one + anywhere" invariant;
specifically, update ReadStructureTest (tests around the "accept the + segment
at any position, at most once" block) to include an assertion that
ReadStructure("8B+M10T+S") (and similar non-adjacent multi-+ strings) throws an
Exception using the same an[Exception] shouldBe thrownBy pattern so the parser
rejects multiple '+' even when they are not adjacent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9a3c5f08-44b3-4474-9c6f-85f3f3bfff9b

📥 Commits

Reviewing files that changed from the base of the PR and between 768d38c and a3c7fbe.

📒 Files selected for processing (7)

src/main/scala/com/fulcrumgenomics/fastq/DemuxFastqs.scala
src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala
src/main/scala/com/fulcrumgenomics/illumina/RunInfo.scala
src/main/scala/com/fulcrumgenomics/umi/ExtractUmisFromBam.scala
src/main/scala/com/fulcrumgenomics/util/ReadStructure.scala
src/test/scala/com/fulcrumgenomics/umi/ExtractUmisFromBamTest.scala
src/test/scala/com/fulcrumgenomics/util/ReadStructureTest.scala

coderabbitai · 2026-04-23T22:41:20Z

+  /** Extracts the bases corresponding to each read segment.  [[SegmentType.Skip]] segments are omitted
+    * from the returned sequence because callers almost always throw those bases away; use the two-argument
+    * overload with `includeSkips = true` to keep them. */
+  def extract(bases: String): Seq[SubReadWithoutQuals] = extract(bases, includeSkips = false)
+
+  /** Extracts the bases corresponding to each read segment.
+    *
+    * @param bases the raw read bases
+    * @param includeSkips whether to include [[SegmentType.Skip]] segments in the returned sequence
+    */
+  def extract(bases: String, includeSkips: Boolean): Seq[SubReadWithoutQuals] = {
+    validateReadLength(bases.length)
+    val readLen = bases.length
+    val builder = IndexedSeq.newBuilder[SubReadWithoutQuals]
+    var i       = 0
+    while (i < segments.length) {
+      val seg = segments(i)
+      if (includeSkips || seg.kind != SegmentType.Skip) {
+        val (start, end) = spanOf(i, readLen)
+        builder += SubReadWithoutQuals(bases.substring(start, end), seg)
+      }
+      i += 1
+    }
+    builder.result()
+  }
+
+  /** Extracts the bases and qualities corresponding to each read segment.
+    *
+    * @param bases the raw read bases
+    * @param quals the raw read qualities; must be the same length as `bases`
+    * @param includeSkips whether to include [[SegmentType.Skip]] segments in the returned sequence;
+    *                     defaults to `false` because callers almost always throw those bases away.
+    */
+  def extract(bases: String, quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals] = {
+    require(bases.length == quals.length, s"Bases and quals differ in length: ${bases.length} vs ${quals.length}")
+    validateReadLength(bases.length)
+    val readLen = bases.length
+    val builder = IndexedSeq.newBuilder[SubReadWithQuals]
+    var i       = 0
+    while (i < segments.length) {
+      val seg = segments(i)
+      if (includeSkips || seg.kind != SegmentType.Skip) {
+        val (start, end) = spanOf(i, readLen)
+        builder += SubReadWithQuals(bases.substring(start, end), quals.substring(start, end), seg)
+      }
+      i += 1
+    }
+    builder.result()
  }


⚠️ Potential issue | 🟡 Minor

Skip-exclusion default is a silent semantic change — worth a release-notes mention.

In-tree callers (DemuxFastqs, FastqToBam, ExtractUmisFromBam) all post-filter by kind, so this is safe here. Out-of-tree callers that previously iterated extract(bases) without filtering will silently lose Skip entries. Given feat!, that's acceptable — just call it out in the changelog alongside the other breaking items.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/main/scala/com/fulcrumgenomics/util/ReadStructure.scala` around lines 253 - 301, Defaulting includeSkips to false in the extract overloads (see extract(bases: String): Seq[SubReadWithoutQuals] and extract(bases: String, quals: String, includeSkips: Boolean = false): Seq[SubReadWithQuals]) is a silent breaking change for out-of-tree callers who relied on Skip segments; update the release notes/changelog to call this out as a breaking change and also update the Scaladoc for both extract methods to explicitly document the new default behavior and recommend callers pass includeSkips = true if they relied on Skip entries.

clintval

Why does the Scala implementation allow a zero-length + operator but the Rust implementation does not? Would you be willing to update our wiki docs on read structures here too?

https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

clintval · 2026-05-19T15:46:30Z

+      |molecular identifiers will be concatenated using
      |the `-` delimiter and placed in the given SAM record tag (`RX` by default).  Similarly, the sample barcode bases


Line lengths are a bit broken here.

Should render fine, but looks odd in the source code.

clintval · 2026-05-19T15:50:26Z

+    // molecular-barcode prefix that gets stripped: a clip past the 10M should shift down by 10
+    record("XT") = 50
+    ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("10M90T"))
+    record[Int]("XT") shouldBe 40
+
+    // clip position exactly on a segment boundary (end of the skip)
+    record("XT") = 30
+    ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("20T10S70T"))
+    record[Int]("XT") shouldBe 20
+
+    // clip position exactly on the next segment's start when that segment is a template
+    record("XT") = 20
+    ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("10M10T80T"))
+    record[Int]("XT") shouldBe 10
+
+    // multiple template segments with a mol barcode between them; clip lands in the second template
+    record("XT") = 40
+    ExtractUmisFromBam.updateClippingInformation(record, Some("XT"), ReadStructure("20T10M70T"))
+    record[Int]("XT") shouldBe 30


Decent tests, but did you mean to add any read segments with the + operator before/after the UMI segment to flex the new functionality?

I've added a positive test for a trailing +T, but the tool actually explicitly checks for and forbids a + segment in the non terminal position (separate and apart from the read-structure parsing), so I've added a test that that check still fires.

Throughout this PR I've avoided changing behavior other than the RS parsing. I think we could entertain separate PRs to loosen restrictions in tools where it makes sense.

tfenne · 2026-05-19T21:07:11Z

Why does the Scala implementation allow a zero-length + operator but the Rust implementation does not? Would you be willing to update our wiki docs on read structures here too?

https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

Honestly, I think it should be that + is one or more, but there is behavior in DemuxFastqs that relies on it being 0-or-more. I didn't want to break that, so left it be for now. Individual tools can always assert that each segment has a length post-extraction.

I will update the wiki docs once this PR is merged.

Add a positive regression case for a terminal + segment with a clipping attribute, and a negative case asserting the non-terminal + precondition fires. Also reflow the read-structure paragraph in DemuxFastqs so the source no longer wraps mid-clause. Both per PR #1157 review feedback.

tfenne requested review from clintval and nh13 as code owners April 23, 2026 22:34

coderabbitai Bot requested changes Apr 23, 2026

View reviewed changes

docs: clarify FastqToBam read structure example

92d5e90

clintval approved these changes May 19, 2026

View reviewed changes

clintval assigned tfenne May 19, 2026

tfenne merged commit 21470e2 into main May 19, 2026
13 checks passed

tfenne deleted the tf_non_terminal_plus_in_rs branch May 19, 2026 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat!: allow non-terminal + segment in read structures#1157

feat!: allow non-terminal + segment in read structures#1157
tfenne merged 3 commits into
mainfrom
tf_non_terminal_plus_in_rs

tfenne commented Apr 23, 2026

Uh oh!

codecov Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-19 21:17 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

clintval left a comment

Uh oh!

clintval May 19, 2026

Uh oh!

clintval May 19, 2026

Uh oh!

tfenne May 19, 2026

Uh oh!

tfenne commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		\|molecular identifiers will be concatenated using
		\|the `-` delimiter and placed in the given SAM record tag (`RX` by default). Similarly, the sample barcode bases

Uh oh!

Conversation

tfenne commented Apr 23, 2026

Uh oh!

codecov Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-19 21:17 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

clintval left a comment

Choose a reason for hiding this comment

Uh oh!

clintval May 19, 2026

Choose a reason for hiding this comment

Uh oh!

clintval May 19, 2026

Choose a reason for hiding this comment

Uh oh!

tfenne May 19, 2026

Choose a reason for hiding this comment

Uh oh!

tfenne commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 23, 2026 •

edited

Loading

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-19 21:17 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading