fix(volume): use `Fault` state instead of `Schedule` state for the Expand() check by davidcheng0922 · Pull Request #4517 · longhorn/longhorn-manager

davidcheng0922 · 2026-02-05T09:26:58Z

Which issue(s) this PR fixes:

longhorn/longhorn#12606

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai · 2026-02-05T09:27:07Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 45efd6c3-775c-46ff-b5a4-b82ebd73a348

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This pull request improves volume expansion logic by replacing an overly restrictive scheduling check with a more appropriate fault state check.

Changes:

Replaces the VolumeConditionTypeScheduled check with a VolumeRobustnessFaulted check in the Expand() function
Updates the error message to reflect the new validation logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

manager/volume.go

shuo-wu

As we discussed, I am worried about some race issues during the expansion. For example:

A failed replica gets salvaged/reused, and maybe longhorn-manager is trying to start the replica process.
An expansion request is received. Then volume_controller will update the spec for all the engine and replicas.
The engine process will start to handle the expansion. But the reused replica at step 1 may be not in the engine mode map ==> This replica may miss the expansion.

davidcheng0922 · 2026-02-06T02:51:45Z

As we discussed, I am worried about some race issues during the expansion. For example:

A failed replica gets salvaged/reused, and maybe longhorn-manager is trying to start the replica process.

An expansion request is received. Then volume_controller will update the spec for all the engine and replicas.

The engine process will start to handle the expansion. But the reused replica at step 1 may be not in the engine mode map ==> This replica may miss the expansion.

After meeting discussion, 2 cases concern :

Replica is added to engine mode map before expansion -> wait for rebuilding then expand
Replica is added to engine mode map after expansion -> should fail if small replica is added

Another method -> just block the volume expansion if not healthy in control plane to avoid any race condition

Update:

Replica is added to engine mode map before expansion -> wait for rebuilding then expand

Both v1 and v2 are fine for expanding the volume during rebuilding

Replica is added to engine mode map after expansion -> should fail if small replica is added

It works correctly for triggering rebuilding during expansion, but it might simply not be hitting the race condition.

Code Tracing:

v1: Before adding the replica to engine, it will check and expand the replica (code reference)
v2: Since our implementation uses bdev_raid_grow_base_bdev instead of the SPDK method bdev_raid_add_base_bdev (ref) to add a base bdev—and our implementation currently allows adding a smaller replica—I tested this directly using go-spdk-helper inside the container.

"raid": {
      "strip_size_kb": 0,
      "state": "online",
      "raid_level": "raid1",
      "num_base_bdevs": 2,
      "num_base_bdevs_discovered": 2,
      "num_base_bdevs_operational": 2,
      "base_bdevs_list": [
	      {
		      "name": "v1-r-22eb2536n1",
		      "uuid": "ce11f3d7-eac8-5fba-96c6-ab709a5c9548",
		      "is_configured": true,
		      "data_offset": 0,
		      "data_size": 4194304
	      },
	      {
		      "name": "disk-1/small-lvol",
		      "uuid": "ca95ea36-3be0-498e-a072-a0ffe46d2b85",
		      "is_configured": true,
		      "data_offset": 0,
		      "data_size": 2097152
	      }
      ],
      "superblock": false
}

It might be a rare race condition, but the risk of adding a smaller replica is real. We should deny the smaller replica in v2.

Update after spdk code trace:

In SPDK RAID:

add_base_bdev does not increase the number of base devices.
It only attaches a bdev into an existing empty slot; if no empty slot exists, the operation fails.
grow_base_bdev is the only API that expands the RAID to include more base devices.
It updates RAID metadata, increases slot count, adjusts capacity, and requires module-level grow support.

So, we use grow_base_bdev is correct, we only need to block the smaller size bdev

cc @derekbit @shuo-wu @COLDTURNIP

…pand() check Signed-off-by: David Cheng <david.cheng@suse.com>

davidcheng0922 requested review from COLDTURNIP, derekbit and shuo-wu February 5, 2026 09:26

davidcheng0922 self-assigned this Feb 5, 2026

davidcheng0922 mentioned this pull request Feb 5, 2026

[IMPROVEMENT] Removing Scheduled condition check during volume expansion longhorn/longhorn#12606

Open

COLDTURNIP requested a review from Copilot February 5, 2026 11:08

Copilot started reviewing on behalf of COLDTURNIP February 5, 2026 11:08 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

COLDTURNIP reviewed Feb 5, 2026

View reviewed changes

manager/volume.go Show resolved Hide resolved

shuo-wu reviewed Feb 5, 2026

View reviewed changes

derekbit force-pushed the issue-12606-removing-condition-check-volume-expansion branch from eb1d757 to dcd11e9 Compare February 23, 2026 01:47

davidcheng0922 mentioned this pull request Feb 23, 2026

[IMPROVEMENT] Prevent adding a smaller base bdev to a RAID bdev longhorn/longhorn#12693

Open

derekbit force-pushed the issue-12606-removing-condition-check-volume-expansion branch from dcd11e9 to 8ebe665 Compare April 3, 2026 02:55

davidcheng0922 force-pushed the issue-12606-removing-condition-check-volume-expansion branch from 8ebe665 to 728c534 Compare April 13, 2026 05:28

fix(volume): use Fault state instead of Schedule state for the Ex…

96f9191

…pand() check Signed-off-by: David Cheng <david.cheng@suse.com>

davidcheng0922 force-pushed the issue-12606-removing-condition-check-volume-expansion branch from 728c534 to 96f9191 Compare April 13, 2026 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(volume): use `Fault` state instead of `Schedule` state for the Expand() check#4517

fix(volume): use `Fault` state instead of `Schedule` state for the Expand() check#4517
davidcheng0922 wants to merge 1 commit intolonghorn:masterfrom
davidcheng0922:issue-12606-removing-condition-check-volume-expansion

davidcheng0922 commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Review skipped

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

shuo-wu left a comment

Uh oh!

davidcheng0922 commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

davidcheng0922 commented Feb 5, 2026

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

shuo-wu left a comment

Choose a reason for hiding this comment

Uh oh!

davidcheng0922 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update after spdk code trace:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

davidcheng0922 commented Feb 6, 2026 •

edited

Loading