Skip to content

fix(volume): use Fault state instead of Schedule state for the Expand() check#4517

Open
davidcheng0922 wants to merge 1 commit intolonghorn:masterfrom
davidcheng0922:issue-12606-removing-condition-check-volume-expansion
Open

fix(volume): use Fault state instead of Schedule state for the Expand() check#4517
davidcheng0922 wants to merge 1 commit intolonghorn:masterfrom
davidcheng0922:issue-12606-removing-condition-check-volume-expansion

Conversation

@davidcheng0922
Copy link
Copy Markdown
Contributor

Which issue(s) this PR fixes:

longhorn/longhorn#12606

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 5, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 45efd6c3-775c-46ff-b5a4-b82ebd73a348

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request improves volume expansion logic by replacing an overly restrictive scheduling check with a more appropriate fault state check.

Changes:

  • Replaces the VolumeConditionTypeScheduled check with a VolumeRobustnessFaulted check in the Expand() function
  • Updates the error message to reflect the new validation logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@shuo-wu shuo-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, I am worried about some race issues during the expansion. For example:

  1. A failed replica gets salvaged/reused, and maybe longhorn-manager is trying to start the replica process.
  2. An expansion request is received. Then volume_controller will update the spec for all the engine and replicas.
  3. The engine process will start to handle the expansion. But the reused replica at step 1 may be not in the engine mode map ==> This replica may miss the expansion.

@davidcheng0922
Copy link
Copy Markdown
Contributor Author

davidcheng0922 commented Feb 6, 2026

As we discussed, I am worried about some race issues during the expansion. For example:

  1. A failed replica gets salvaged/reused, and maybe longhorn-manager is trying to start the replica process.
  2. An expansion request is received. Then volume_controller will update the spec for all the engine and replicas.
  3. The engine process will start to handle the expansion. But the reused replica at step 1 may be not in the engine mode map ==> This replica may miss the expansion.

After meeting discussion, 2 cases concern :

  1. Replica is added to engine mode map before expansion -> wait for rebuilding then expand
  2. Replica is added to engine mode map after expansion -> should fail if small replica is added

Another method -> just block the volume expansion if not healthy in control plane to avoid any race condition

Update:

  1. Replica is added to engine mode map before expansion -> wait for rebuilding then expand
  • Both v1 and v2 are fine for expanding the volume during rebuilding
  1. Replica is added to engine mode map after expansion -> should fail if small replica is added

It works correctly for triggering rebuilding during expansion, but it might simply not be hitting the race condition.

Code Tracing:

  • v1: Before adding the replica to engine, it will check and expand the replica (code reference)

  • v2: Since our implementation uses bdev_raid_grow_base_bdev instead of the SPDK method bdev_raid_add_base_bdev (ref) to add a base bdev—and our implementation currently allows adding a smaller replica—I tested this directly using go-spdk-helper inside the container.

"raid": {
      "strip_size_kb": 0,
      "state": "online",
      "raid_level": "raid1",
      "num_base_bdevs": 2,
      "num_base_bdevs_discovered": 2,
      "num_base_bdevs_operational": 2,
      "base_bdevs_list": [
	      {
		      "name": "v1-r-22eb2536n1",
		      "uuid": "ce11f3d7-eac8-5fba-96c6-ab709a5c9548",
		      "is_configured": true,
		      "data_offset": 0,
		      "data_size": 4194304
	      },
	      {
		      "name": "disk-1/small-lvol",
		      "uuid": "ca95ea36-3be0-498e-a072-a0ffe46d2b85",
		      "is_configured": true,
		      "data_offset": 0,
		      "data_size": 2097152
	      }
      ],
      "superblock": false
}

It might be a rare race condition, but the risk of adding a smaller replica is real. We should deny the smaller replica in v2.

Update after spdk code trace:

In SPDK RAID:

  • add_base_bdev does not increase the number of base devices.
    It only attaches a bdev into an existing empty slot; if no empty slot exists, the operation fails.

  • grow_base_bdev is the only API that expands the RAID to include more base devices.
    It updates RAID metadata, increases slot count, adjusts capacity, and requires module-level grow support.

So, we use grow_base_bdev is correct, we only need to block the smaller size bdev

cc @derekbit @shuo-wu @COLDTURNIP

@derekbit derekbit force-pushed the issue-12606-removing-condition-check-volume-expansion branch from eb1d757 to dcd11e9 Compare February 23, 2026 01:47
@derekbit derekbit force-pushed the issue-12606-removing-condition-check-volume-expansion branch from dcd11e9 to 8ebe665 Compare April 3, 2026 02:55
@davidcheng0922 davidcheng0922 force-pushed the issue-12606-removing-condition-check-volume-expansion branch from 8ebe665 to 728c534 Compare April 13, 2026 05:28
…pand() check

Signed-off-by: David Cheng <david.cheng@suse.com>
@davidcheng0922 davidcheng0922 force-pushed the issue-12606-removing-condition-check-volume-expansion branch from 728c534 to 96f9191 Compare April 13, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants