chore: move rack-id from cfg into states, improve on tests by maxo-nv · Pull Request #740 · NVIDIA/ncx-infra-controller-core

maxo-nv · 2026-03-30T07:59:26Z

Follow-up to the rack Validation state machine (see #416). This patch wires the Validation SM transitions to live RVS machine metadata and cleans up the run_id plumbing. As a nice bonus: a few integration tests are refactored to use actual RackHandler instead of test handler stubs, inspired by @Matthias247 and @chet work done in tests/rack_state_controller/handler.rs

In particular, TestRackStateHandler was a stub that hard-coded transitions (Pending -> InProgress -> Partial -> Validated -> Ready) to exercise the controller loop. Per Mattias's note on #402 (comment), the right approach is to test real handlers against a mock DB. The deleted tests would have continued passing even if the actual handler logic was completely broken.

Changes:

Move run_id from RackConfig into RackValidationState - all non-Pending variants (InProgress, Partial, FailedPartial, Validated, Failed) carry run_id. Removed validation_run_id from RackConfig.
Add find_rv_run_id: scans compute tray machines for the rv.run-id label set by RVS, driving Pending -> InProgress.
Add clear_rv_labels / strip_rv_labels: strip all rv.* labels from compute tray metadata in one transaction on Maintenance(Completed), so each RVS cycle starts with a clean slate.
Extend TestRackDbBuilder with with_compute_trays / with_power_shelves, resolving the existing TODO and removing the need for a manual db_rack::update call from tests.
Extend the rack lifecycle integration test to walk through Validation(Pending -> InProgress -> Partial -> Validated) -> Ready.
Deleted test_rack_state_transition_validation: the test manually defined every possible RackState variant to the DB and read it back. Effectively a DB serialization round-trip test with no state machine involvement.
Replaced test_can_retrieve_rack_state_history with the new test_can_retrieve_rack_state_history_with_real_handler: it uses RackStateHandler, drives the rack through the full lifecycle via actual DB state, and asserts on the resulting state at every step.
Deleted test_rack_error_state_handling: asserted count > 0 on the fake handler rather than verifying the rack stayed in Error. The real behavior is now tested in test_error_state_does_nothing_with_controller using RackStateHandler directly.
Deleted test_rack_state_transitions: the test verified only that the controller infrastructure called TestRackStateHandler at least once per rack ID. No assertions about which state transitions occurred or whether the handler logic was correct.

Description

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

github-actions · 2026-03-30T08:01:16Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-03-30 08:01:16 UTC | Commit: 685ddbd}

Copilot

Pull request overview

This PR updates the rack validation flow to correlate state-machine transitions with live RVS machine metadata by moving run_id tracking into RackValidationState, and refactors rack lifecycle integration tests to exercise the real RackStateHandler against a DB-backed environment.

Changes:

Move validation run_id from RackConfig to non-Pending RackValidationState variants; derive run_id from observed rv.run-id machine labels.
Add RV label lifecycle helpers: scan compute trays for rv.run-id to start validation, and clear rv.* labels on Maintenance(Completed) to avoid stale metadata.
Refactor/replace rack controller integration tests to use RackStateHandler and extend TestRackDbBuilder to populate compute trays/power shelves.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
crates/api-model/src/rack.rs	Moves `run_id` into validation substates; removes `validation_run_id` from `RackConfig`.
crates/api/src/state_controller/rack/handler.rs	Adds `find_rv_run_id` + `clear_rv_labels`; wires validation gating/filtering to `run_id` carried in substates.
crates/api/src/state_controller/rack/rv.rs	Adds `strip_rv_labels`; makes partition aggregation require a run-id filter.
crates/api/src/state_controller/rack/io.rs	Updates state-to-metric mapping for new `RackValidationState` shapes.
crates/api-db/src/rack.rs	Updates rack creation defaults to match updated `RackConfig`.
crates/api/src/tests/common/api_fixtures/site_explorer.rs	Extends `TestRackDbBuilder` to populate compute trays and power shelves in `RackConfig`.
crates/api/src/tests/rack_state_controller/mod.rs	Replaces stub-driven lifecycle test with real-handler-driven lifecycle test; updates error-state test.
crates/api/src/tests/rack_state_controller/handler.rs	Exposes config/id helpers for reuse; removes `validation_run_id` from test configs.
crates/api/src/tests/rack_health.rs	Removes `validation_run_id` from rack config setup in health tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/api-model/src/rack.rs

crates/api/src/state_controller/rack/rv.rs

crates/api/src/state_controller/rack/handler.rs

Copilot · 2026-03-30T08:07:32Z

crates/api/src/tests/rack_state_controller/mod.rs

+        "{\"state\": \"validation\", \"rack_validation\": {\"Validated\": {\"run_id\": \"test-run\"}}}",
+        "{\"state\": \"ready\"}",


This comment is now incorrect: in this test the run_id comes from the real RackStateHandler observing rv.run-id machine labels, not from TestRackStateHandler. Please update/remove the comment to avoid confusion when reading the test expectations.

maybe I'm blind, but I don't see what are you referring to

crates/api/src/state_controller/rack/rv.rs

Matthias247 · 2026-03-30T23:13:55Z

Thanks for adding such a long description! Based on that, there might be some conflicts with #736 which also touches rack states and improves the state of the unit-tests. We will need to land them one by one. I'll look at the code in detail separately.

crates/api-model/src/rack.rs

Matthias247 · 2026-03-30T23:18:29Z

crates/api/src/state_controller/rack/handler.rs

+    rack: &Rack,
+    ctx: &mut StateHandlerContext<'_, RackStateHandlerContextObjects>,
+) -> Result<(), StateHandlerError> {
+    let machine_ids = &rack.config.compute_trays;


It's probably better to use db::machine::find_machine_ids where the filter (search_config) carries the rack IDs. That gives you just the ingested racks. The config thing will go away.

same applies for the other functions here

Done,

/me clearly misunderstands the notion of "injested rack"

Matthias247 · 2026-03-30T23:25:49Z

crates/api/src/state_controller/rack/rv.rs

+        .labels
+        .keys()
+        .filter(|k| k.starts_with("rv."))
+        .cloned()


you could do a Vec<&String> or Vec<&str> and avoid the cloned(). Just a bit more efficient since it doesn't require the allocations

done: simplified even further

Matthias247 · 2026-03-30T23:27:34Z

crates/api/src/tests/rack_state_controller/mod.rs

+    // machine in compute_trays and requires ManagedHostState::Ready. We insert
+    // machine records directly, bypassing the full machine state machine since
+    // this test exercises the rack SM only.
+    db_machine::create(


we should use create_managed_host with a config that ties it to the rack. That assures that the machine actually goes through the right and valid lifecycle states.

at first I thought you referred to MachineCreator::create_managed_host() and it sent me down a rabbit hole...

Now I'm realizing that you were talking about TestEnv::create_managed_host right?

In which case it looks like another rabbit hole for me, albeit considerably smaller. In that hole, at least around the entrance, I see that I need to wire up the rack SM controller similar to what is already done for other controllers:

struct TestEnv { // ... machine_state_controller: Arc<Mutex<StateController<MachineStateControllerIO>>>, spdm_state_controller: Arc<Mutex<StateController<SpdmStateControllerIO>>>, pub machine_state_handler: SwapHandler<MachineStateHandler>, network_segment_controller: Arc<Mutex<StateController<NetworkSegmentStateControllerIO>>>, ib_partition_controller: Arc<Mutex<StateController<IBPartitionStateControllerIO>>>, power_shelf_controller: Arc<Mutex<StateController<PowerShelfStateControllerIO>>>, switch_controller: Arc<Mutex<StateController<SwitchStateControllerIO>>>, // ... }

but... am I on the right path? is using TestEnv form of Rack SM testing we need? What would be a nice example of using it? I'm confused what should I copy :)

You shouldn't need to touch any of TestEnv, I think Matthias is just saying you can do something like:

let host_1 = create_managed_host_with_config( &env, ManagedHostConfig::with_expected_machine_data(rack_data.clone()), ) .await .host() .db_machine(&mut txn) .await; let host_2 = create_managed_host_with_config( &env, ManagedHostConfig::with_expected_machine_data(rack_data.clone()), ) .await .host() .db_machine(&mut txn) .await;

Which will create each machine and give you back the Machine entry from the database. Then you can create the rack with the MAC addresses from the hosts as expected_compute_trays, and their id's as the with_compute_trays, ie.

TestRackDbBuilder::new() .with_rack_id(rack_id.clone()) .with_expected_compute_trays(vec![ host_1.bmc_info.mac.unwrap().bytes(), host_2.bmc_info.mac.unwrap().bytes(), ]) .with_expected_power_shelves(vec![]) .with_expected_switches(vec![]) .with_compute_trays(vec![host_1.id, host_2.id]) .with_rack_type("Simple") .persist(&mut txn) .await?;

You are both right :)

TestEnv at the moment does not have a reference to rack_state_controller and thereby we can't really do rack_state_controller.run_single_iteration() and anything which depends on it. So yes, it should be added.

But then there's also the question on how the Machine is created, and whether it automatically creates the Rack on the way. And Ken's first code snippet would do that - but it would require the expected-rack to be also stored first.

I think @vinodchitraliNVIDIA 's change has the code to automaticall create a rack object if its referenced by a Machine and referenced in expected-machines. So maybe we should just wait until that is in and then rework all of these tests.

maxo-nv · 2026-04-01T09:27:04Z

Thanks for adding such a long description!

You're welcome! I guess it's just my mediocre creativity trying to break through commit messages, hence all this verbosity.

Based on that, there might be some conflicts with #736 which also touches rack states and improves the state of the unit-tests. We will need to land them one by one. I'll look at the code in detail separately.

Agree. I'm happy to yield the way to #736 and fix merge conflicts on my side.

kensimon · 2026-04-01T13:33:29Z

crates/api/src/state_controller/rack/handler.rs

                    } => {
                        match rack_firmware_upgrade {
                            RackFirmwareUpgradeState::Compute => {
-                                // TODO[#416]: Implement compute firmware upgrade


Why drop these issue numbers? I actually think it's a great idea for TODO comments to always correspond to an issue number. (In a past life we had a lint that ensured "no TODO comments without a bug ID" because otherwise TODO comments just languish and never get addressed.)

these parts are all tracked in Jira, and we don't have a corresponding GH issue numbers. let me talk to @vinodchitraliNVIDIA about that

kensimon · 2026-04-01T13:39:19Z

crates/api/src/tests/common/api_fixtures/site_explorer.rs

+        self
+    }
+
+    #[allow(dead_code)]


Please delete this function (or comment it out if you want to leave a hint for future work), or if you want, just add a .with_power_shelves(vec![]) in one of the tests so it's not dead code.

crap, this fn is actually used, my bad I left dead_code there

kensimon · 2026-04-01T16:53:06Z

crates/api/src/tests/rack_state_controller/mod.rs

+    // machine in compute_trays and requires ManagedHostState::Ready. We insert
+    // machine records directly, bypassing the full machine state machine since
+    // this test exercises the rack SM only.
+    db_machine::create(


You shouldn't need to touch any of TestEnv, I think Matthias is just saying you can do something like:

let host_1 = create_managed_host_with_config( &env, ManagedHostConfig::with_expected_machine_data(rack_data.clone()), ) .await .host() .db_machine(&mut txn) .await; let host_2 = create_managed_host_with_config( &env, ManagedHostConfig::with_expected_machine_data(rack_data.clone()), ) .await .host() .db_machine(&mut txn) .await;

Which will create each machine and give you back the Machine entry from the database. Then you can create the rack with the MAC addresses from the hosts as expected_compute_trays, and their id's as the with_compute_trays, ie.

TestRackDbBuilder::new() .with_rack_id(rack_id.clone()) .with_expected_compute_trays(vec![ host_1.bmc_info.mac.unwrap().bytes(), host_2.bmc_info.mac.unwrap().bytes(), ]) .with_expected_power_shelves(vec![]) .with_expected_switches(vec![]) .with_compute_trays(vec![host_1.id, host_2.id]) .with_rack_type("Simple") .persist(&mut txn) .await?;

Matthias247

Unit-tests should get some updates, but we can coordinate with Vinods changes.

Matthias247 · 2026-04-02T21:13:07Z

@maxo-nv all the conflicting branches have been merged. Can you rebase on top of them?

@Matthias247

Follow-up to the rack Validation state machine (see NVIDIA#416). This patch wires the Validation SM transitions to live RVS machine metadata and cleans up the run_id plumbing. As a nice bonus: a few integration tests are refactored to use actual `RackHandler` instead of test handler stubs, inspired by @Matthias247 and @chet work done in `tests/rack_state_controller/handler.rs` In particular, `TestRackStateHandler` was a stub that hard-coded transitions (`Pending -> InProgress -> Partial -> Validated -> Ready`) to exercise the controller loop. Per Mattias's note on NVIDIA#402 (comment), the right approach is to test real handlers against a mock DB. The deleted tests would have continued passing even if the actual handler logic was completely broken. Changes: - Move `run_id` from `RackConfig` into `RackValidationState` - all non-`Pending` variants (`InProgress`, `Partial`, `FailedPartial`, `Validated`, `Failed`) carry `run_id`. Removed `validation_run_id` from `RackConfig`. - Add `find_rv_run_id`: scans compute tray machines for the `rv.run-id` label set by RVS, driving `Pending -> InProgress`. - Add `clear_rv_labels` / `strip_rv_labels`: strip all `rv.*` labels from compute tray metadata in one transaction on `Maintenance(Completed)`, so each RVS cycle starts with a clean slate. - Extend `TestRackDbBuilder` with `with_compute_trays` / `with_power_shelves`, resolving the existing TODO and removing the need for a manual `db_rack::update` call from tests. - Extend the rack lifecycle integration test to walk through `Validation(Pending -> InProgress -> Partial -> Validated) -> Ready`. - Deleted `test_rack_state_transition_validation`: the test manually defined every possible `RackState` variant to the DB and read it back. Effectively a DB serialization round-trip test with no state machine involvement. - Replaced `test_can_retrieve_rack_state_history` with the new `test_can_retrieve_rack_state_history_with_real_handler`: it uses `RackStateHandler`, drives the rack through the full lifecycle via actual DB state, and asserts on the resulting state at every step. - Deleted `test_rack_error_state_handling`: asserted `count > 0` on the fake handler rather than verifying the rack stayed in `Error`. The real behavior is now tested in `test_error_state_does_nothing_with_controller` using `RackStateHandler` directly. - Deleted `test_rack_state_transitions`: the test verified only that the controller infrastructure called `TestRackStateHandler` at least once per rack ID. No assertions about which state transitions occurred or whether the handler logic was correct. Signed-off-by: Max Olender <molender@nvidia.com>

Copilot AI review requested due to automatic review settings March 30, 2026 07:59

maxo-nv force-pushed the feature/better_rack_integration_tests branch from 53a14e3 to 685ddbd Compare March 30, 2026 07:59

maxo-nv requested a review from a team as a code owner March 30, 2026 07:59

Copilot started reviewing on behalf of maxo-nv March 30, 2026 07:59 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Matthias247 reviewed Mar 30, 2026

View reviewed changes

kensimon reviewed Apr 1, 2026

View reviewed changes

Matthias247 approved these changes Apr 1, 2026

View reviewed changes

maxo-nv mentioned this pull request Apr 1, 2026

Multinode Test (aka Rack Validation) Framework #416

Open

10 tasks

maxo-nv force-pushed the feature/better_rack_integration_tests branch from 3facbc3 to 86113ef Compare April 7, 2026 07:28

		"{\"state\": \"validation\", \"rack_validation\": {\"Validated\": {\"run_id\": \"test-run\"}}}",
		"{\"state\": \"ready\"}",

Conversation

maxo-nv commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

github-actions bot commented Mar 30, 2026

🔐 TruffleHog Secret Scan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Matthias247 commented Mar 30, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxo-nv commented Apr 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

Matthias247 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxo-nv commented Mar 30, 2026 •

edited

Loading