fix: OnDemand Rack maintainance mode#803
fix: OnDemand Rack maintainance mode#803vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
This is draft version. Quick POC to check feasibilty and to get early feedback |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-04-03 15:19:53 UTC | Commit: 9d6a92d |
9d6a92d to
863294e
Compare
On-demand maintenance allows an operator to trigger a maintenance cycle on a rack i that is in the **Ready** or **Error** state. It supports both **full-rack** and **partial-rack** scoping — the caller can optionally specify which machines, switches, or power shelves to maintain. Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
863294e to
81642da
Compare
|
|
||
| On-demand maintenance allows an operator to trigger a maintenance cycle on a rack that is in the **Ready** or **Error** state. It supports both **full-rack** and **partial-rack** scoping — the caller can optionally specify which machines, switches, or power shelves to maintain. | ||
|
|
||
| ## Scope: Full Rack vs Partial Rack |
There was a problem hiding this comment.
So far the scope defines "what is affected", but not "what needs to be done".
I think we should either enhance maintenance scope to let the admin provide a list of maintenance actions (e.g. "update firmware", "fix nmx-c", etc) or have different states all of these maintenances.
| **Messages**: | ||
|
|
||
| ```protobuf | ||
| message RackMaintenanceOnDemandRequest { |
There was a problem hiding this comment.
In case of the update action, the admin would also need to be able to reference the software packages to be installed in some way
|
|
||
| ## RBAC | ||
|
|
||
| The `OnDemandRackMaintenance` permission is granted to the `ForgeAdminCLI` role. |
There was a problem hiding this comment.
probably also for the REST API. Sooner or later it needs to get supported there
| updated_config.maintenance_requested = Some(scope); | ||
|
|
||
| let mut txn = api.txn_begin().await?; | ||
| db_rack::update(&mut txn, &rack_id, &updated_config).await?; |
There was a problem hiding this comment.
this is racing with the check for rack.config.maintenance_requested. 2 executions could hit the path at the same time.
To prevent we can do either of
- fetch
FOR UPDATE - guard on writing with
WHERE maintenance_requested IS NULL - guard on writing on the last observed version field, and incrementing the version in the scope of the request.
I'd likely go for 3) and emit a ConcurrentModificationError if we detect it. Mostly since I think we'd want to update the rack version anyway in the scope of the request.
Description
On-demand maintenance allows an operator to trigger a maintenance cycle on a rack i that is in the Ready or Error state. It supports both full-rack and partial-rack scoping — the caller can optionally specify which machines, switches, or power shelves to maintain.
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes