Skip to content

refactor: split machine state handler into per-state modules#725

Open
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:organizing-the-state-machine
Open

refactor: split machine state handler into per-state modules#725
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:organizing-the-state-machine

Conversation

@chet
Copy link
Copy Markdown
Contributor

@chet chet commented Mar 25, 2026

Description

Sooo, I'm not sure how we feel about this, but it's come up before, and it's been in the back of my mind, and after restructuring the entire admin-cli crate, I wanted to poke at this and see what the vibe was.

If we hate it, I can close it!

In this case the machine state handler was over 10k lines long -- it was/is a single file with the entire machine lifestyle all in ther (all four sub-handlers, firmware upgrade logic, shared functions, etc). In a sense, it's easy to grep through, because you can just jump around the file, but at the same time, it has also grown so much that it has become hard to look at as well. It's growth is mainly a testament to its success, fwiw.

So, this refactor splits it into a module-per-state structure. The idea is that the module path tells you exactly what state you're looking at (which I think will help with discovery), e.g.

managed_host_state::verify_rms_membership::handle(...)
managed_host_state::waiting_for_cleanup::handle(...)
managed_host_state::failed::handle(...)

Each state enum gets its own directory, each variant gets its own file with a .handle() function, and handler.rs is now a < 1k line "dispatch" hub that reads like a table of contents.

New structure:

    handler.rs
    handler/
      managed_host_state/              ManagedHostState::
      dpu_machine_state/               DpuDiscoveringState + DpuInitState
      host_machine_state/              MachineState::*
      instance_state/                  InstanceState::*
      host_reprovision_state/          HostReprovisionState::*
      common.rs                        Shared utility functions.

There were zero logic changes with this -- it's purely a structural refactor. All ~1100 tests still pass as-is; I set out with the intent that all existing tests should continue to pass to consider success.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

No testing required on purpose. The intent was that all ~1100 existing tests should pass.

Sooo, I'm not sure how we feel about this, but it's come up before, and it's been in the back of my mind, and after restructuring the entire `admin-cli` crate, I wanted to poke at this and see what the vibe was.

If we hate it, I can close it!

In this case the machine state handler was over 10k lines long -- it was/is a single file with the entire machine lifestyle all in ther (all four sub-handlers, firmware upgrade logic, shared functions, etc). In a sense, it's *easy* to grep through, because you can just jump around the file, but at the same time, it has also grown so much that it has become hard to look at as well. It's growth is mainly a testament to its success, fwiw.

So, this refactor splits it into a module-per-state structure. The idea is that the module path tells you exactly what state you're looking at (which I think will help with discovery), e.g.

```
managed_host_state::verify_rms_membership::handle(...)
managed_host_state::waiting_for_cleanup::handle(...)
managed_host_state::failed::handle(...)
```

Each state enum gets its own directory, each variant gets its own file with a `.handle()` function, and `handler.rs` is now a < 1k line "dispatch" hub that reads like a table of contents.

New structure:
```
    handler.rs
    handler/
      managed_host_state/              ManagedHostState::
      dpu_machine_state/               DpuDiscoveringState + DpuInitState
      host_machine_state/              MachineState::*
      instance_state/                  InstanceState::*
      host_reprovision_state/          HostReprovisionState::*
      common.rs                        Shared utility functions.
```

There were **zero logic changes** with this -- it's purely a structural refactor. All ~1100 tests still pass as-is; I set out with the intent that all existing tests should continue to pass to consider success.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet requested a review from a team as a code owner March 25, 2026 22:16
@chet chet requested a review from Matthias247 March 25, 2026 22:17
@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-25 22:19:26 UTC | Commit: dc973d4

@wminckler
Copy link
Copy Markdown
Contributor

I never really thought about the file path being related to a state. typically we've been breaking it up by component, like machine validation states are all in machine_validation.rs

this is going to blow peoples minds

@chet
Copy link
Copy Markdown
Contributor Author

chet commented Mar 30, 2026

I never really thought about the file path being related to a state. typically we've been breaking it up by component, like machine validation states are all in machine_validation.rs

this is going to blow peoples minds

Lol @wminckler. I'm not sure if that's meant as a good or bad thing. I'm also not sure how people feel about this in general. CamelCase <-> snake_case is a natural mapping for me so this seems like a sane way to organize, if we wanted to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants