chore: move rack-id from cfg into states, improve on tests#740
chore: move rack-id from cfg into states, improve on tests#740maxo-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
53a14e3 to
685ddbd
Compare
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-03-30 08:01:16 UTC | Commit: 685ddbd |
There was a problem hiding this comment.
Pull request overview
This PR updates the rack validation flow to correlate state-machine transitions with live RVS machine metadata by moving run_id tracking into RackValidationState, and refactors rack lifecycle integration tests to exercise the real RackStateHandler against a DB-backed environment.
Changes:
- Move validation
run_idfromRackConfigto non-PendingRackValidationStatevariants; deriverun_idfrom observedrv.run-idmachine labels. - Add RV label lifecycle helpers: scan compute trays for
rv.run-idto start validation, and clearrv.*labels onMaintenance(Completed)to avoid stale metadata. - Refactor/replace rack controller integration tests to use
RackStateHandlerand extendTestRackDbBuilderto populate compute trays/power shelves.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/api-model/src/rack.rs | Moves run_id into validation substates; removes validation_run_id from RackConfig. |
| crates/api/src/state_controller/rack/handler.rs | Adds find_rv_run_id + clear_rv_labels; wires validation gating/filtering to run_id carried in substates. |
| crates/api/src/state_controller/rack/rv.rs | Adds strip_rv_labels; makes partition aggregation require a run-id filter. |
| crates/api/src/state_controller/rack/io.rs | Updates state-to-metric mapping for new RackValidationState shapes. |
| crates/api-db/src/rack.rs | Updates rack creation defaults to match updated RackConfig. |
| crates/api/src/tests/common/api_fixtures/site_explorer.rs | Extends TestRackDbBuilder to populate compute trays and power shelves in RackConfig. |
| crates/api/src/tests/rack_state_controller/mod.rs | Replaces stub-driven lifecycle test with real-handler-driven lifecycle test; updates error-state test. |
| crates/api/src/tests/rack_state_controller/handler.rs | Exposes config/id helpers for reuse; removes validation_run_id from test configs. |
| crates/api/src/tests/rack_health.rs | Removes validation_run_id from rack config setup in health tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "{\"state\": \"validation\", \"rack_validation\": {\"Validated\": {\"run_id\": \"test-run\"}}}", | ||
| "{\"state\": \"ready\"}", |
There was a problem hiding this comment.
This comment is now incorrect: in this test the run_id comes from the real RackStateHandler observing rv.run-id machine labels, not from TestRackStateHandler. Please update/remove the comment to avoid confusion when reading the test expectations.
There was a problem hiding this comment.
maybe I'm blind, but I don't see what are you referring to
|
Thanks for adding such a long description! Based on that, there might be some conflicts with #736 which also touches rack states and improves the state of the unit-tests. We will need to land them one by one. I'll look at the code in detail separately. |
| rack: &Rack, | ||
| ctx: &mut StateHandlerContext<'_, RackStateHandlerContextObjects>, | ||
| ) -> Result<(), StateHandlerError> { | ||
| let machine_ids = &rack.config.compute_trays; |
There was a problem hiding this comment.
It's probably better to use db::machine::find_machine_ids where the filter (search_config) carries the rack IDs. That gives you just the ingested racks. The config thing will go away.
same applies for the other functions here
There was a problem hiding this comment.
Done,
/me clearly misunderstands the notion of "injested rack"
| .labels | ||
| .keys() | ||
| .filter(|k| k.starts_with("rv.")) | ||
| .cloned() |
There was a problem hiding this comment.
you could do a Vec<&String> or Vec<&str> and avoid the cloned(). Just a bit more efficient since it doesn't require the allocations
There was a problem hiding this comment.
done: simplified even further
| // machine in compute_trays and requires ManagedHostState::Ready. We insert | ||
| // machine records directly, bypassing the full machine state machine since | ||
| // this test exercises the rack SM only. | ||
| db_machine::create( |
There was a problem hiding this comment.
we should use create_managed_host with a config that ties it to the rack. That assures that the machine actually goes through the right and valid lifecycle states.
There was a problem hiding this comment.
at first I thought you referred to MachineCreator::create_managed_host() and it sent me down a rabbit hole...
Now I'm realizing that you were talking about TestEnv::create_managed_host right?
In which case it looks like another rabbit hole for me, albeit considerably smaller. In that hole, at least around the entrance, I see that I need to wire up the rack SM controller similar to what is already done for other controllers:
struct TestEnv {
// ...
machine_state_controller: Arc<Mutex<StateController<MachineStateControllerIO>>>,
spdm_state_controller: Arc<Mutex<StateController<SpdmStateControllerIO>>>,
pub machine_state_handler: SwapHandler<MachineStateHandler>,
network_segment_controller: Arc<Mutex<StateController<NetworkSegmentStateControllerIO>>>,
ib_partition_controller: Arc<Mutex<StateController<IBPartitionStateControllerIO>>>,
power_shelf_controller: Arc<Mutex<StateController<PowerShelfStateControllerIO>>>,
switch_controller: Arc<Mutex<StateController<SwitchStateControllerIO>>>,
// ...
}but... am I on the right path? is using TestEnv form of Rack SM testing we need? What would be a nice example of using it? I'm confused what should I copy :)
There was a problem hiding this comment.
You shouldn't need to touch any of TestEnv, I think Matthias is just saying you can do something like:
let host_1 = create_managed_host_with_config(
&env,
ManagedHostConfig::with_expected_machine_data(rack_data.clone()),
)
.await
.host()
.db_machine(&mut txn)
.await;
let host_2 = create_managed_host_with_config(
&env,
ManagedHostConfig::with_expected_machine_data(rack_data.clone()),
)
.await
.host()
.db_machine(&mut txn)
.await;
Which will create each machine and give you back the Machine entry from the database. Then you can create the rack with the MAC addresses from the hosts as expected_compute_trays, and their id's as the with_compute_trays, ie.
TestRackDbBuilder::new()
.with_rack_id(rack_id.clone())
.with_expected_compute_trays(vec![
host_1.bmc_info.mac.unwrap().bytes(),
host_2.bmc_info.mac.unwrap().bytes(),
])
.with_expected_power_shelves(vec![])
.with_expected_switches(vec![])
.with_compute_trays(vec![host_1.id, host_2.id])
.with_rack_type("Simple")
.persist(&mut txn)
.await?;
There was a problem hiding this comment.
You are both right :)
TestEnv at the moment does not have a reference to rack_state_controller and thereby we can't really do rack_state_controller.run_single_iteration() and anything which depends on it. So yes, it should be added.
But then there's also the question on how the Machine is created, and whether it automatically creates the Rack on the way. And Ken's first code snippet would do that - but it would require the expected-rack to be also stored first.
I think @vinodchitraliNVIDIA 's change has the code to automaticall create a rack object if its referenced by a Machine and referenced in expected-machines. So maybe we should just wait until that is in and then rework all of these tests.
You're welcome! I guess it's just my mediocre creativity trying to break through commit messages, hence all this verbosity.
Agree. I'm happy to yield the way to #736 and fix merge conflicts on my side. |
| } => { | ||
| match rack_firmware_upgrade { | ||
| RackFirmwareUpgradeState::Compute => { | ||
| // TODO[#416]: Implement compute firmware upgrade |
There was a problem hiding this comment.
Why drop these issue numbers? I actually think it's a great idea for TODO comments to always correspond to an issue number. (In a past life we had a lint that ensured "no TODO comments without a bug ID" because otherwise TODO comments just languish and never get addressed.)
There was a problem hiding this comment.
these parts are all tracked in Jira, and we don't have a corresponding GH issue numbers. let me talk to @vinodchitraliNVIDIA about that
| self | ||
| } | ||
|
|
||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Please delete this function (or comment it out if you want to leave a hint for future work), or if you want, just add a .with_power_shelves(vec![]) in one of the tests so it's not dead code.
There was a problem hiding this comment.
crap, this fn is actually used, my bad I left dead_code there
| // machine in compute_trays and requires ManagedHostState::Ready. We insert | ||
| // machine records directly, bypassing the full machine state machine since | ||
| // this test exercises the rack SM only. | ||
| db_machine::create( |
There was a problem hiding this comment.
You shouldn't need to touch any of TestEnv, I think Matthias is just saying you can do something like:
let host_1 = create_managed_host_with_config(
&env,
ManagedHostConfig::with_expected_machine_data(rack_data.clone()),
)
.await
.host()
.db_machine(&mut txn)
.await;
let host_2 = create_managed_host_with_config(
&env,
ManagedHostConfig::with_expected_machine_data(rack_data.clone()),
)
.await
.host()
.db_machine(&mut txn)
.await;
Which will create each machine and give you back the Machine entry from the database. Then you can create the rack with the MAC addresses from the hosts as expected_compute_trays, and their id's as the with_compute_trays, ie.
TestRackDbBuilder::new()
.with_rack_id(rack_id.clone())
.with_expected_compute_trays(vec![
host_1.bmc_info.mac.unwrap().bytes(),
host_2.bmc_info.mac.unwrap().bytes(),
])
.with_expected_power_shelves(vec![])
.with_expected_switches(vec![])
.with_compute_trays(vec![host_1.id, host_2.id])
.with_rack_type("Simple")
.persist(&mut txn)
.await?;
Matthias247
left a comment
There was a problem hiding this comment.
Unit-tests should get some updates, but we can coordinate with Vinods changes.
|
@maxo-nv all the conflicting branches have been merged. Can you rebase on top of them? |
Follow-up to the rack Validation state machine (see NVIDIA#416). This patch wires the Validation SM transitions to live RVS machine metadata and cleans up the run_id plumbing. As a nice bonus: a few integration tests are refactored to use actual `RackHandler` instead of test handler stubs, inspired by @Matthias247 and @chet work done in `tests/rack_state_controller/handler.rs` In particular, `TestRackStateHandler` was a stub that hard-coded transitions (`Pending -> InProgress -> Partial -> Validated -> Ready`) to exercise the controller loop. Per Mattias's note on NVIDIA#402 (comment), the right approach is to test real handlers against a mock DB. The deleted tests would have continued passing even if the actual handler logic was completely broken. Changes: - Move `run_id` from `RackConfig` into `RackValidationState` - all non-`Pending` variants (`InProgress`, `Partial`, `FailedPartial`, `Validated`, `Failed`) carry `run_id`. Removed `validation_run_id` from `RackConfig`. - Add `find_rv_run_id`: scans compute tray machines for the `rv.run-id` label set by RVS, driving `Pending -> InProgress`. - Add `clear_rv_labels` / `strip_rv_labels`: strip all `rv.*` labels from compute tray metadata in one transaction on `Maintenance(Completed)`, so each RVS cycle starts with a clean slate. - Extend `TestRackDbBuilder` with `with_compute_trays` / `with_power_shelves`, resolving the existing TODO and removing the need for a manual `db_rack::update` call from tests. - Extend the rack lifecycle integration test to walk through `Validation(Pending -> InProgress -> Partial -> Validated) -> Ready`. - Deleted `test_rack_state_transition_validation`: the test manually defined every possible `RackState` variant to the DB and read it back. Effectively a DB serialization round-trip test with no state machine involvement. - Replaced `test_can_retrieve_rack_state_history` with the new `test_can_retrieve_rack_state_history_with_real_handler`: it uses `RackStateHandler`, drives the rack through the full lifecycle via actual DB state, and asserts on the resulting state at every step. - Deleted `test_rack_error_state_handling`: asserted `count > 0` on the fake handler rather than verifying the rack stayed in `Error`. The real behavior is now tested in `test_error_state_does_nothing_with_controller` using `RackStateHandler` directly. - Deleted `test_rack_state_transitions`: the test verified only that the controller infrastructure called `TestRackStateHandler` at least once per rack ID. No assertions about which state transitions occurred or whether the handler logic was correct. Signed-off-by: Max Olender <molender@nvidia.com>
3facbc3 to
86113ef
Compare
Follow-up to the rack Validation state machine (see #416). This patch wires the Validation SM transitions to live RVS machine metadata and cleans up the run_id plumbing. As a nice bonus: a few integration tests are refactored to use actual
RackHandlerinstead of test handler stubs, inspired by @Matthias247 and @chet work done intests/rack_state_controller/handler.rsIn particular,
TestRackStateHandlerwas a stub that hard-coded transitions (Pending -> InProgress -> Partial -> Validated -> Ready) to exercise the controller loop. Per Mattias's note on #402 (comment), the right approach is to test real handlers against a mock DB. The deleted tests would have continued passing even if the actual handler logic was completely broken.Changes:
Move
run_idfromRackConfigintoRackValidationState- all non-Pendingvariants (InProgress,Partial,FailedPartial,Validated,Failed) carryrun_id. Removedvalidation_run_idfromRackConfig.Add
find_rv_run_id: scans compute tray machines for therv.run-idlabel set by RVS, drivingPending -> InProgress.Add
clear_rv_labels/strip_rv_labels: strip allrv.*labels from compute tray metadata in one transaction onMaintenance(Completed), so each RVS cycle starts with a clean slate.Extend
TestRackDbBuilderwithwith_compute_trays/with_power_shelves, resolving the existing TODO and removing the need for a manualdb_rack::updatecall from tests.Extend the rack lifecycle integration test to walk through
Validation(Pending -> InProgress -> Partial -> Validated) -> Ready.Deleted
test_rack_state_transition_validation: the test manually defined every possibleRackStatevariant to the DB and read it back. Effectively a DB serialization round-trip test with no state machine involvement.Replaced
test_can_retrieve_rack_state_historywith the newtest_can_retrieve_rack_state_history_with_real_handler: it usesRackStateHandler, drives the rack through the full lifecycle via actual DB state, and asserts on the resulting state at every step.Deleted
test_rack_error_state_handling: assertedcount > 0on the fake handler rather than verifying the rack stayed inError. The real behavior is now tested intest_error_state_does_nothing_with_controllerusingRackStateHandlerdirectly.Deleted
test_rack_state_transitions: the test verified only that the controller infrastructure calledTestRackStateHandlerat least once per rack ID. No assertions about which state transitions occurred or whether the handler logic was correct.Description
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes