fix(volume): use Fault state instead of Schedule state for the Expand() check#4517
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This pull request improves volume expansion logic by replacing an overly restrictive scheduling check with a more appropriate fault state check.
Changes:
- Replaces the
VolumeConditionTypeScheduledcheck with aVolumeRobustnessFaultedcheck in theExpand()function - Updates the error message to reflect the new validation logic
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
shuo-wu
left a comment
There was a problem hiding this comment.
As we discussed, I am worried about some race issues during the expansion. For example:
- A failed replica gets salvaged/reused, and maybe longhorn-manager is trying to start the replica process.
- An expansion request is received. Then volume_controller will update the spec for all the engine and replicas.
- The engine process will start to handle the expansion. But the reused replica at step 1 may be not in the engine mode map ==> This replica may miss the expansion.
After meeting discussion, 2 cases concern :
Another method -> just block the volume expansion if not healthy in control plane to avoid any race condition Update:
It works correctly for triggering rebuilding during expansion, but it might simply not be hitting the race condition. Code Tracing:
"raid": {
"strip_size_kb": 0,
"state": "online",
"raid_level": "raid1",
"num_base_bdevs": 2,
"num_base_bdevs_discovered": 2,
"num_base_bdevs_operational": 2,
"base_bdevs_list": [
{
"name": "v1-r-22eb2536n1",
"uuid": "ce11f3d7-eac8-5fba-96c6-ab709a5c9548",
"is_configured": true,
"data_offset": 0,
"data_size": 4194304
},
{
"name": "disk-1/small-lvol",
"uuid": "ca95ea36-3be0-498e-a072-a0ffe46d2b85",
"is_configured": true,
"data_offset": 0,
"data_size": 2097152
}
],
"superblock": false
}It might be a rare race condition, but the risk of adding a smaller replica is real. We should deny the smaller replica in v2. Update after spdk code trace:In SPDK RAID:
So, we use |
eb1d757 to
dcd11e9
Compare
dcd11e9 to
8ebe665
Compare
8ebe665 to
728c534
Compare
…pand() check Signed-off-by: David Cheng <david.cheng@suse.com>
728c534 to
96f9191
Compare
Which issue(s) this PR fixes:
longhorn/longhorn#12606
What this PR does / why we need it:
Special notes for your reviewer:
Additional documentation or context