Skip to content

fix: OnDemand Rack maintainance mode#803

Draft
vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
vinodchitraliNVIDIA:vc/ondemand
Draft

fix: OnDemand Rack maintainance mode#803
vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
vinodchitraliNVIDIA:vc/ondemand

Conversation

@vinodchitraliNVIDIA
Copy link
Copy Markdown
Contributor

Description

On-demand maintenance allows an operator to trigger a maintenance cycle on a rack i that is in the Ready or Error state. It supports both full-rack and partial-rack scoping — the caller can optionally specify which machines, switches, or power shelves to maintain.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@vinodchitraliNVIDIA vinodchitraliNVIDIA requested a review from a team as a code owner April 3, 2026 15:17
@vinodchitraliNVIDIA vinodchitraliNVIDIA marked this pull request as draft April 3, 2026 15:18
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@vinodchitraliNVIDIA
Copy link
Copy Markdown
Contributor Author

This is draft version. Quick POC to check feasibilty and to get early feedback

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-03 15:19:53 UTC | Commit: 9d6a92d

On-demand maintenance allows an operator to trigger a maintenance cycle on a rack i
that is in the **Ready** or **Error** state. It supports both **full-rack**
and **partial-rack** scoping — the caller can optionally specify
which machines, switches, or power shelves to maintain.

Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>

On-demand maintenance allows an operator to trigger a maintenance cycle on a rack that is in the **Ready** or **Error** state. It supports both **full-rack** and **partial-rack** scoping — the caller can optionally specify which machines, switches, or power shelves to maintain.

## Scope: Full Rack vs Partial Rack
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the scope defines "what is affected", but not "what needs to be done".

I think we should either enhance maintenance scope to let the admin provide a list of maintenance actions (e.g. "update firmware", "fix nmx-c", etc) or have different states all of these maintenances.

**Messages**:

```protobuf
message RackMaintenanceOnDemandRequest {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of the update action, the admin would also need to be able to reference the software packages to be installed in some way


## RBAC

The `OnDemandRackMaintenance` permission is granted to the `ForgeAdminCLI` role.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably also for the REST API. Sooner or later it needs to get supported there

updated_config.maintenance_requested = Some(scope);

let mut txn = api.txn_begin().await?;
db_rack::update(&mut txn, &rack_id, &updated_config).await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is racing with the check for rack.config.maintenance_requested. 2 executions could hit the path at the same time.

To prevent we can do either of

  1. fetch FOR UPDATE
  2. guard on writing with WHERE maintenance_requested IS NULL
  3. guard on writing on the last observed version field, and incrementing the version in the scope of the request.

I'd likely go for 3) and emit a ConcurrentModificationError if we detect it. Mostly since I think we'd want to update the rack version anyway in the scope of the request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants