Skip to content

chore: move rack-id from cfg into states, improve on tests#740

Open
maxo-nv wants to merge 1 commit intoNVIDIA:mainfrom
maxo-nv:feature/better_rack_integration_tests
Open

chore: move rack-id from cfg into states, improve on tests#740
maxo-nv wants to merge 1 commit intoNVIDIA:mainfrom
maxo-nv:feature/better_rack_integration_tests

Conversation

@maxo-nv
Copy link
Copy Markdown
Contributor

@maxo-nv maxo-nv commented Mar 30, 2026

Follow-up to the rack Validation state machine (see #416). This patch wires the Validation SM transitions to live RVS machine metadata and cleans up the run_id plumbing. As a nice bonus: a few integration tests are refactored to use actual RackHandler instead of test handler stubs, inspired by @Matthias247 and @chet work done in tests/rack_state_controller/handler.rs

In particular, TestRackStateHandler was a stub that hard-coded transitions (Pending -> InProgress -> Partial -> Validated -> Ready) to exercise the controller loop. Per Mattias's note on #402 (comment), the right approach is to test real handlers against a mock DB. The deleted tests would have continued passing even if the actual handler logic was completely broken.

Changes:

  • Move run_id from RackConfig into RackValidationState - all non-Pending variants (InProgress, Partial, FailedPartial, Validated, Failed) carry run_id. Removed validation_run_id from RackConfig.

  • Add find_rv_run_id: scans compute tray machines for the rv.run-id label set by RVS, driving Pending -> InProgress.

  • Add clear_rv_labels / strip_rv_labels: strip all rv.* labels from compute tray metadata in one transaction on Maintenance(Completed), so each RVS cycle starts with a clean slate.

  • Extend TestRackDbBuilder with with_compute_trays / with_power_shelves, resolving the existing TODO and removing the need for a manual db_rack::update call from tests.

  • Extend the rack lifecycle integration test to walk through Validation(Pending -> InProgress -> Partial -> Validated) -> Ready.

  • Deleted test_rack_state_transition_validation: the test manually defined every possible RackState variant to the DB and read it back. Effectively a DB serialization round-trip test with no state machine involvement.

  • Replaced test_can_retrieve_rack_state_history with the new test_can_retrieve_rack_state_history_with_real_handler: it uses RackStateHandler, drives the rack through the full lifecycle via actual DB state, and asserts on the resulting state at every step.

  • Deleted test_rack_error_state_handling: asserted count > 0 on the fake handler rather than verifying the rack stayed in Error. The real behavior is now tested in test_error_state_does_nothing_with_controller using RackStateHandler directly.

  • Deleted test_rack_state_transitions: the test verified only that the controller infrastructure called TestRackStateHandler at least once per rack ID. No assertions about which state transitions occurred or whether the handler logic was correct.

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Copilot AI review requested due to automatic review settings March 30, 2026 07:59
@maxo-nv maxo-nv force-pushed the feature/better_rack_integration_tests branch from 53a14e3 to 685ddbd Compare March 30, 2026 07:59
@maxo-nv maxo-nv requested a review from a team as a code owner March 30, 2026 07:59
@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-30 08:01:16 UTC | Commit: 685ddbd

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the rack validation flow to correlate state-machine transitions with live RVS machine metadata by moving run_id tracking into RackValidationState, and refactors rack lifecycle integration tests to exercise the real RackStateHandler against a DB-backed environment.

Changes:

  • Move validation run_id from RackConfig to non-Pending RackValidationState variants; derive run_id from observed rv.run-id machine labels.
  • Add RV label lifecycle helpers: scan compute trays for rv.run-id to start validation, and clear rv.* labels on Maintenance(Completed) to avoid stale metadata.
  • Refactor/replace rack controller integration tests to use RackStateHandler and extend TestRackDbBuilder to populate compute trays/power shelves.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
crates/api-model/src/rack.rs Moves run_id into validation substates; removes validation_run_id from RackConfig.
crates/api/src/state_controller/rack/handler.rs Adds find_rv_run_id + clear_rv_labels; wires validation gating/filtering to run_id carried in substates.
crates/api/src/state_controller/rack/rv.rs Adds strip_rv_labels; makes partition aggregation require a run-id filter.
crates/api/src/state_controller/rack/io.rs Updates state-to-metric mapping for new RackValidationState shapes.
crates/api-db/src/rack.rs Updates rack creation defaults to match updated RackConfig.
crates/api/src/tests/common/api_fixtures/site_explorer.rs Extends TestRackDbBuilder to populate compute trays and power shelves in RackConfig.
crates/api/src/tests/rack_state_controller/mod.rs Replaces stub-driven lifecycle test with real-handler-driven lifecycle test; updates error-state test.
crates/api/src/tests/rack_state_controller/handler.rs Exposes config/id helpers for reuse; removes validation_run_id from test configs.
crates/api/src/tests/rack_health.rs Removes validation_run_id from rack config setup in health tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +401 to +402
"{\"state\": \"validation\", \"rack_validation\": {\"Validated\": {\"run_id\": \"test-run\"}}}",
"{\"state\": \"ready\"}",
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is now incorrect: in this test the run_id comes from the real RackStateHandler observing rv.run-id machine labels, not from TestRackStateHandler. Please update/remove the comment to avoid confusion when reading the test expectations.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm blind, but I don't see what are you referring to

@Matthias247
Copy link
Copy Markdown
Contributor

Thanks for adding such a long description! Based on that, there might be some conflicts with #736 which also touches rack states and improves the state of the unit-tests. We will need to land them one by one. I'll look at the code in detail separately.

rack: &Rack,
ctx: &mut StateHandlerContext<'_, RackStateHandlerContextObjects>,
) -> Result<(), StateHandlerError> {
let machine_ids = &rack.config.compute_trays;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to use db::machine::find_machine_ids where the filter (search_config) carries the rack IDs. That gives you just the ingested racks. The config thing will go away.

same applies for the other functions here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,

/me clearly misunderstands the notion of "injested rack"

.labels
.keys()
.filter(|k| k.starts_with("rv."))
.cloned()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could do a Vec<&String> or Vec<&str> and avoid the cloned(). Just a bit more efficient since it doesn't require the allocations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: simplified even further

// machine in compute_trays and requires ManagedHostState::Ready. We insert
// machine records directly, bypassing the full machine state machine since
// this test exercises the rack SM only.
db_machine::create(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use create_managed_host with a config that ties it to the rack. That assures that the machine actually goes through the right and valid lifecycle states.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first I thought you referred to MachineCreator::create_managed_host() and it sent me down a rabbit hole...

Now I'm realizing that you were talking about TestEnv::create_managed_host right?

In which case it looks like another rabbit hole for me, albeit considerably smaller. In that hole, at least around the entrance, I see that I need to wire up the rack SM controller similar to what is already done for other controllers:

struct TestEnv {
// ...
    machine_state_controller: Arc<Mutex<StateController<MachineStateControllerIO>>>,
    spdm_state_controller: Arc<Mutex<StateController<SpdmStateControllerIO>>>,
    pub machine_state_handler: SwapHandler<MachineStateHandler>,
    network_segment_controller: Arc<Mutex<StateController<NetworkSegmentStateControllerIO>>>,
    ib_partition_controller: Arc<Mutex<StateController<IBPartitionStateControllerIO>>>,
    power_shelf_controller: Arc<Mutex<StateController<PowerShelfStateControllerIO>>>,
    switch_controller: Arc<Mutex<StateController<SwitchStateControllerIO>>>,
// ...
}

but... am I on the right path? is using TestEnv form of Rack SM testing we need? What would be a nice example of using it? I'm confused what should I copy :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to touch any of TestEnv, I think Matthias is just saying you can do something like:

    let host_1 = create_managed_host_with_config(
        &env,
        ManagedHostConfig::with_expected_machine_data(rack_data.clone()),
    )
    .await
    .host()
    .db_machine(&mut txn)
    .await;
    let host_2 = create_managed_host_with_config(
        &env,
        ManagedHostConfig::with_expected_machine_data(rack_data.clone()),
    )
    .await
    .host()
    .db_machine(&mut txn)
    .await;

Which will create each machine and give you back the Machine entry from the database. Then you can create the rack with the MAC addresses from the hosts as expected_compute_trays, and their id's as the with_compute_trays, ie.

TestRackDbBuilder::new()
    .with_rack_id(rack_id.clone())
    .with_expected_compute_trays(vec![
        host_1.bmc_info.mac.unwrap().bytes(),
        host_2.bmc_info.mac.unwrap().bytes(),
    ])
    .with_expected_power_shelves(vec![])
    .with_expected_switches(vec![])
    .with_compute_trays(vec![host_1.id, host_2.id])
    .with_rack_type("Simple")
    .persist(&mut txn)
    .await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are both right :)

TestEnv at the moment does not have a reference to rack_state_controller and thereby we can't really do rack_state_controller.run_single_iteration() and anything which depends on it. So yes, it should be added.

But then there's also the question on how the Machine is created, and whether it automatically creates the Rack on the way. And Ken's first code snippet would do that - but it would require the expected-rack to be also stored first.

I think @vinodchitraliNVIDIA 's change has the code to automaticall create a rack object if its referenced by a Machine and referenced in expected-machines. So maybe we should just wait until that is in and then rework all of these tests.

@maxo-nv
Copy link
Copy Markdown
Contributor Author

maxo-nv commented Apr 1, 2026

Thanks for adding such a long description!

You're welcome! I guess it's just my mediocre creativity trying to break through commit messages, hence all this verbosity.

Based on that, there might be some conflicts with #736 which also touches rack states and improves the state of the unit-tests. We will need to land them one by one. I'll look at the code in detail separately.

Agree. I'm happy to yield the way to #736 and fix merge conflicts on my side.

} => {
match rack_firmware_upgrade {
RackFirmwareUpgradeState::Compute => {
// TODO[#416]: Implement compute firmware upgrade
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why drop these issue numbers? I actually think it's a great idea for TODO comments to always correspond to an issue number. (In a past life we had a lint that ensured "no TODO comments without a bug ID" because otherwise TODO comments just languish and never get addressed.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these parts are all tracked in Jira, and we don't have a corresponding GH issue numbers. let me talk to @vinodchitraliNVIDIA about that

self
}

#[allow(dead_code)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete this function (or comment it out if you want to leave a hint for future work), or if you want, just add a .with_power_shelves(vec![]) in one of the tests so it's not dead code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crap, this fn is actually used, my bad I left dead_code there

// machine in compute_trays and requires ManagedHostState::Ready. We insert
// machine records directly, bypassing the full machine state machine since
// this test exercises the rack SM only.
db_machine::create(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to touch any of TestEnv, I think Matthias is just saying you can do something like:

    let host_1 = create_managed_host_with_config(
        &env,
        ManagedHostConfig::with_expected_machine_data(rack_data.clone()),
    )
    .await
    .host()
    .db_machine(&mut txn)
    .await;
    let host_2 = create_managed_host_with_config(
        &env,
        ManagedHostConfig::with_expected_machine_data(rack_data.clone()),
    )
    .await
    .host()
    .db_machine(&mut txn)
    .await;

Which will create each machine and give you back the Machine entry from the database. Then you can create the rack with the MAC addresses from the hosts as expected_compute_trays, and their id's as the with_compute_trays, ie.

TestRackDbBuilder::new()
    .with_rack_id(rack_id.clone())
    .with_expected_compute_trays(vec![
        host_1.bmc_info.mac.unwrap().bytes(),
        host_2.bmc_info.mac.unwrap().bytes(),
    ])
    .with_expected_power_shelves(vec![])
    .with_expected_switches(vec![])
    .with_compute_trays(vec![host_1.id, host_2.id])
    .with_rack_type("Simple")
    .persist(&mut txn)
    .await?;

Copy link
Copy Markdown
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit-tests should get some updates, but we can coordinate with Vinods changes.

@Matthias247
Copy link
Copy Markdown
Contributor

@maxo-nv all the conflicting branches have been merged. Can you rebase on top of them?

Follow-up to the rack Validation state machine (see NVIDIA#416). This patch
wires the Validation SM transitions to live RVS machine metadata and
cleans up the run_id plumbing. As a nice bonus: a few integration tests
are refactored to use actual `RackHandler` instead of test handler
stubs, inspired by @Matthias247 and @chet work done in
`tests/rack_state_controller/handler.rs`

In particular, `TestRackStateHandler` was a stub that hard-coded
transitions (`Pending -> InProgress -> Partial -> Validated -> Ready`)
to exercise the controller loop. Per Mattias's note on
NVIDIA#402 (comment),
the right approach is to test real handlers against a mock DB. The
deleted tests would have continued passing even if the actual handler
logic was completely broken.

Changes:

- Move `run_id` from `RackConfig` into `RackValidationState` - all
  non-`Pending` variants (`InProgress`, `Partial`, `FailedPartial`,
  `Validated`, `Failed`) carry `run_id`. Removed `validation_run_id`
  from `RackConfig`.

- Add `find_rv_run_id`: scans compute tray machines for the `rv.run-id`
  label set by RVS, driving `Pending -> InProgress`.

- Add `clear_rv_labels` / `strip_rv_labels`: strip all `rv.*` labels
  from compute tray metadata in one transaction on
  `Maintenance(Completed)`, so each RVS cycle starts with a clean slate.

- Extend `TestRackDbBuilder` with `with_compute_trays` /
  `with_power_shelves`, resolving the existing TODO and removing the
  need for a manual `db_rack::update` call from tests.

- Extend the rack lifecycle integration test to walk through
  `Validation(Pending -> InProgress -> Partial -> Validated) -> Ready`.

- Deleted `test_rack_state_transition_validation`: the test manually
  defined every possible `RackState` variant to the DB and read it back.
  Effectively a DB serialization round-trip test with no state machine
  involvement.

- Replaced `test_can_retrieve_rack_state_history` with the new
  `test_can_retrieve_rack_state_history_with_real_handler`: it uses
  `RackStateHandler`, drives the rack through the full lifecycle via
  actual DB state, and asserts on the resulting state at every step.

- Deleted `test_rack_error_state_handling`: asserted `count > 0` on the
  fake handler rather than verifying the rack stayed in `Error`. The
  real behavior is now tested in
  `test_error_state_does_nothing_with_controller` using
  `RackStateHandler` directly.

- Deleted `test_rack_state_transitions`: the test verified only that the
  controller infrastructure called `TestRackStateHandler` at least once
  per rack ID. No assertions about which state transitions occurred or
  whether the handler logic was correct.

Signed-off-by: Max Olender <molender@nvidia.com>
@maxo-nv maxo-nv force-pushed the feature/better_rack_integration_tests branch from 3facbc3 to 86113ef Compare April 7, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants