Skip to content

feat: add synchronous Redfish POST/PATCH proxy for the Carbide-DPS agent#771

Open
spydaNVIDIA wants to merge 6 commits intoNVIDIA:mainfrom
spydaNVIDIA:pyda_dps
Open

feat: add synchronous Redfish POST/PATCH proxy for the Carbide-DPS agent#771
spydaNVIDIA wants to merge 6 commits intoNVIDIA:mainfrom
spydaNVIDIA:pyda_dps

Conversation

@spydaNVIDIA
Copy link
Copy Markdown
Contributor

Description

The Carbide-DPS agent must authenticate via mTLS using a SPIFFE X.509 certificate with the service identity carbide-dps-agent. Carbide authorizes this identity through both internal RBAC rules and Casbin policy to call RedfishBrowse, RedfishPost, and RedfishPatch.

Allow the Carbide-DPS agent to perform Redfish GET, POST, and PATCH operations on BMCs through Carbide's gRPC API.

POST and PATCH requests are subject to per-principal URI allowlists configured in carbide-api-config.toml, restricting which Redfish endpoints a given service principal can target. External (certificate-based) users bypass the allowlist since the RBAC layer already gates their access.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@spydaNVIDIA spydaNVIDIA self-assigned this Apr 1, 2026
@spydaNVIDIA spydaNVIDIA requested a review from a team as a code owner April 1, 2026 17:25
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-01 17:27:53 UTC | Commit: 241bc5e

Copy link
Copy Markdown
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the distinction between the first-level check (carbide-dps-agent) and what goes into the config file. Does a caller need to have both roles? First carbide-dps-agent to access and then what is in the config file? Or can carbide-dps-agent access everything, but things in the config file can also access specific path. If its the latter, then I'm wondering if we can just go for 1 of the 2 access patterns, so that things are easier to understand.

@spydaNVIDIA
Copy link
Copy Markdown
Contributor Author

I don't fully understand the distinction between the first-level check (carbide-dps-agent) and what goes into the config file. Does a caller need to have both roles? First carbide-dps-agent to access and then what is in the config file? Or can carbide-dps-agent access everything, but things in the config file can also access specific path. If its the latter, then I'm wondering if we can just go for 1 of the 2 access patterns, so that things are easier to understand.

The former. We need the first level check to enable DPS agent to talk to the specific gRPC endpoints (RedfishPost etc). Although it is OK for the agent to get full access to the GET proxy, we do not want the DPS agent to have admin level access to all POST/PATCH actions against BMCs managed by carbide. So, we introduce the config file to define the POST & PATCH methods that the DPS agent is allowed to hit.

Here is an example from forged:
https://gitlab-master.nvidia.com/nvmetal/forged/-/merge_requests/5168

[redfish_proxy.carbide-dps-agent] allowed_post_uris = [ "/redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.EnableProfiles", "/redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.DisableProfiles", "/redfish/v1/Managers/BMC/NodeManager/Domains", ] allowed_patch_uris = [ "/redfish/v1/Managers/BMC/NodeManager/Domains/{id}", ]

@Matthias247
Copy link
Copy Markdown
Contributor

The former. We need the first level check to enable DPS agent to talk to the specific gRPC endpoints (RedfishPost etc). Although it is OK for the agent to get full access to the GET proxy, we do not want the DPS agent to have admin level access to all POST/PATCH actions against BMCs managed by carbide. So, we introduce the config file to define the POST & PATCH methods that the DPS agent is allowed to hit.

Thanks for the explanation! Totally agree we want to limit the callable methods. The only thing which I find surprising with this setup is that the internal_rbac_rules only allow a single caller the methods. And if that's true, then the per-caller permissions in the config file would not apply. Should the internal rules just allow any caller, since the actual checks happen via config file?

@spydaNVIDIA
Copy link
Copy Markdown
Contributor Author

The former. We need the first level check to enable DPS agent to talk to the specific gRPC endpoints (RedfishPost etc). Although it is OK for the agent to get full access to the GET proxy, we do not want the DPS agent to have admin level access to all POST/PATCH actions against BMCs managed by carbide. So, we introduce the config file to define the POST & PATCH methods that the DPS agent is allowed to hit.

Thanks for the explanation! Totally agree we want to limit the callable methods. The only thing which I find surprising with this setup is that the internal_rbac_rules only allow a single caller the methods. And if that's true, then the per-caller permissions in the config file would not apply. Should the internal rules just allow any caller, since the actual checks happen via config file?

The internal_rbac_rules allow both ForgeAdminCli and CarbideDpsAgent access to these RPCs:

x.perm("RedfishBrowse", vec![ForgeAdminCLI, CarbideDpsAgent]); x.perm("RedfishPost", vec![ForgeAdminCLI, CarbideDpsAgent]); x.perm("RedfishPatch", vec![ForgeAdminCLI, CarbideDpsAgent]);

The allowlist exists to constrain automated service-to-service callers (like DPS) that should only touch specific BMC endpoints. External users (like ForgeAdminCLI) bypass the URI allowlist. I was thinking in the future, we may have other services that need proxy access to a different set of actions.

@yoks
Copy link
Copy Markdown
Contributor

yoks commented Apr 1, 2026

I feel we should not let any other service (a part from NICo itself), to do any mutation of any resource under it controll via direct proxy.
This goes against consistent state management and FSM design.

@spydaNVIDIA
Copy link
Copy Markdown
Contributor Author

I feel we should not let any other service (a part from NICo itself), to do any mutation of any resource under it controll via direct proxy. This goes against consistent state management and FSM design.

Linked the design in a PM. @yoks please lmk if it looks OK after reviewing

@spydaNVIDIA spydaNVIDIA force-pushed the pyda_dps branch 2 times, most recently from 47878e1 to 0b50fd2 Compare April 2, 2026 05:43
@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 2, 2026

With this design we will never be able organically integrate external calls into machine management infrastructure because "what is alllowed" is just piece of configuration.

For example, somebody can decide to expose power management interface for external services and nothing prevent one to do this. And carbide doesn't have a single chance to handle it correctly without additional level of patches.

This is specific concern that I raised in discussion of upstream project task #282

@zhaozhongn
Copy link
Copy Markdown

With this design we will never be able organically integrate external calls into machine management infrastructure because "what is alllowed" is just piece of configuration.

For example, somebody can decide to expose power management interface for external services and nothing prevent one to do this. And carbide doesn't have a single chance to handle it correctly without additional level of patches.

This is specific concern that I raised in discussion of upstream project task #282

That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective.

Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described.

@yoks
Copy link
Copy Markdown
Contributor

yoks commented Apr 2, 2026

I mean if it just specific API subset, what stops us exposing them in Carbide and using them? This helps us in future to wire them via state management.

@spydaNVIDIA spydaNVIDIA requested a review from Matthias247 April 2, 2026 18:37
@zhaozhongn
Copy link
Copy Markdown

I mean if it just specific API subset, what stops us exposing them in Carbide and using them? This helps us in future to wire them via state management.

I think we are doing that. Just via config not code?

@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 2, 2026

That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective.

Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described.

I trust you that settings you are going to expose are safe. This is not the matter of cocnern and also not a part of this PR. I'm discussing specifically idea of direct exposure of Redfish interface through GRPC API of Carbide. I believe that once we provide this API we cannot control who and how will use this API. Code doesn't provide any guardrails to parts of Redfish tree that can interfere into carbide logic and therefore anybody can write cool tool that reboots server when state machine doesn't expect it.

This is injection ultimate source of inconsistency in machine state management.

@spydaNVIDIA
Copy link
Copy Markdown
Contributor Author

That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective.
Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described.

I trust you that settings you are going to expose are safe. This is not the matter of cocnern and also not a part of this PR. I'm discussing specifically idea of direct exposure of Redfish interface through GRPC API of Carbide. I believe that once we provide this API we cannot control who and how will use this API. Code doesn't provide any guardrails to parts of Redfish tree that can interfere into carbide logic and therefore anybody can write cool tool that reboots server when state machine doesn't expect it.

This is injection ultimate source of inconsistency in machine state management.

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

@zhaozhongn
Copy link
Copy Markdown

That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective.
Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described.

I trust you that settings you are going to expose are safe. This is not the matter of cocnern and also not a part of this PR. I'm discussing specifically idea of direct exposure of Redfish interface through GRPC API of Carbide. I believe that once we provide this API we cannot control who and how will use this API. Code doesn't provide any guardrails to parts of Redfish tree that can interfere into carbide logic and therefore anybody can write cool tool that reboots server when state machine doesn't expect it.
This is injection ultimate source of inconsistency in machine state management.

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

This is what I meant by using "config" than "code" to achieve the goal. This is OSS, so anyone can change the code as easily as the config in their deployment.

@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 2, 2026

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

Software integrator may not know what endpoints are safe to enable and what endpoints are not. And opposite, we as carbide developers don't know which endpoints will be exposed by software integrators. And two issues with this:

  1. For integrator it may work but randomly fail because of state machine interference.
  2. For us, we don't know what invariant we are breaking by changing our code. Lets say somebody decided that exposing BIOS configuration is safe and use it. After some time we decide to change some parameters of BIOS inside carbide. And we have no clue which carbide deployment out there we are breaking by this action.

@zhaozhongn
Copy link
Copy Markdown

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

Software integrator may not know what endpoints are safe to enable and what endpoints are not. And opposite, we as carbide developers don't know which endpoints will be exposed by software integrators. And two issues with this:

  1. For integrator it may work but randomly fail because of state machine interference.
  2. For us, we don't know what invariant we are breaking by changing our code. Lets say somebody decided that exposing BIOS configuration is safe and use it. After some time we decide to change some parameters of BIOS inside carbide. And we have no clue which carbide deployment out there we are breaking by this action.

I do not disagree with these points. However, this is OSS software we are talking about. There will be many things we cannot 100% assume as people can change code and config equally easily. And from an integration point of view, they should need to do some changes. This is by no means safe.

@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 2, 2026

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

This is what I meant by using "config" than "code" to achieve the goal. This is OSS, so anyone can change the code as easily as the config in their deployment.

This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support.

@zhaozhongn
Copy link
Copy Markdown

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

This is what I meant by using "config" than "code" to achieve the goal. This is OSS, so anyone can change the code as easily as the config in their deployment.

This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support.

I disagree. We exposed an API with a config that's safe. We support that. Any other usage is beyond our control and we will not "support" it. Could it be harder for the integrator? Yes, but that's the cost of this advanced features. The balancing and trade-off calculation will be up to them.

@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 2, 2026

This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support.

I disagree. We exposed an API with a config that's safe. We support that. Any other usage is beyond our control and we will not "support" it. Could it be harder for the integrator? Yes, but that's the cost of this advanced features. The balancing and trade-off calculation will be up to them.

With config it is as raw SQL with regexp validator of allowed expressions in config. But I think that all arguments are on the table. Somebody will decide what direction carbide should take.

poroh
poroh previously requested changes Apr 2, 2026
Copy link
Copy Markdown
Contributor

@poroh poroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments in conversations.

@zhaozhongn
Copy link
Copy Markdown

This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support.

I disagree. We exposed an API with a config that's safe. We support that. Any other usage is beyond our control and we will not "support" it. Could it be harder for the integrator? Yes, but that's the cost of this advanced features. The balancing and trade-off calculation will be up to them.

With config it is as raw SQL with regexp validator of allowed expressions in config. But I think that all arguments are on the table. Somebody will decide what direction carbide should take.

I agree with all the technical concerns. I think we might be able to solve all these elegantly in the future. But we could end up not being able to. Regardless, for now we need to provide these functionalities for very practical reasons. And I do not see a obviously better alternative. Hence the decision to go this way.

@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 2, 2026

I agree with all the technical concerns. I think we might be able to solve all these elegantly in the future. But we could end up not being able to. Regardless, for now we need to provide these functionalities for very practical reasons. And I do not see a obviously better alternative. Hence the decision to go this way.

Obvious alternative is to expose specific RPC command(s) to access those endpoints you need for carbide DPS. It can be modelled as enum:

enum RedfishCommand {
UpdateNodeManagerDomains(...),
NvidiaWorkloadProfileAction(...)
}

You can easy extend this command in future but still keep full control over what is allowed to be done via API.

@zhaozhongn
Copy link
Copy Markdown

I agree with all the technical concerns. I think we might be able to solve all these elegantly in the future. But we could end up not being able to. Regardless, for now we need to provide these functionalities for very practical reasons. And I do not see a obviously better alternative. Hence the decision to go this way.

Obvious alternative is to expose specific RPC command(s) to access those endpoints you need for carbide DPS. It can be modelled as enum:

enum RedfishCommand { UpdateNodeManagerDomains(...), NvidiaWorkloadProfileAction(...) }

You can easy extend this command in future but still keep full control over what is allowed to be done via API.

For that, surely. I have talked to @spydaNVIDIA and he will switch from using string value to an enum value in gRPC interface.

@kensimon
Copy link
Copy Markdown
Contributor

kensimon commented Apr 6, 2026

Obvious alternative is to expose specific RPC command(s) to access those endpoints you need for carbide DPS. It can be modelled as enum:

enum RedfishCommand { UpdateNodeManagerDomains(...), NvidiaWorkloadProfileAction(...) }

There's a lot of discussion I haven't read all of, but I totally agree here. We shouldn't be exposing an arbitrary redfish proxy for POST/PUT/etc requests, but instead expressing the logical operations via our own API and letting callers call that directly (hiding redfish as an implementation detail.)

The PR as it is now places a ton of importance on creating a sufficiently secure config file with a redfish action allowlist, and I think the software ought to be more opinionated than that.

@spydaNVIDIA spydaNVIDIA force-pushed the pyda_dps branch 2 times, most recently from af4340c to c06a12a Compare April 7, 2026 22:18
@spydaNVIDIA
Copy link
Copy Markdown
Contributor Author

@poroh @kensimon I updated the MR to reflect the above discussion.

@spydaNVIDIA spydaNVIDIA requested a review from poroh April 7, 2026 22:20
@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 7, 2026

@poroh @kensimon I updated the MR to reflect the above discussion.

This change is not what I meant by changing the API. And with this change, it breaks abstractions because the enum contains specific URIs for H100/GB200/GB300. Carbide provides machines as an abstraction to users of its interface. So instead of giving users proxy access to BMC, Carbide should provide a high-level interface. In the case of DPS, it is:

  1. Access to the latest power sensor readings (should be plumbed from the health service)
  2. The ability to set up power profiles and power limits for CPU/GPU

It should not provide an interface for Redfish exploration of each individual machine.

In addition, the access footprint for sensor readings also matters. In the case of the "Redfish proxy" approach, it would require DB/Vault access for each individual sensor read.

@zhaozhongn
Copy link
Copy Markdown

zhaozhongn commented Apr 8, 2026

@poroh @kensimon I updated the MR to reflect the above discussion.

This change is not what I meant by changing the API. And with this change, it breaks abstractions because the enum contains specific URIs for H100/GB200/GB300. Carbide provides machines as an abstraction to users of its interface. So instead of giving users proxy access to BMC, Carbide should provide a high-level interface. In the case of DPS, it is:

  1. Access to the latest power sensor readings (should be plumbed from the health service)
  2. The ability to set up power profiles and power limits for CPU/GPU

It should not provide an interface for Redfish exploration of each individual machine.

In addition, the access footprint for sensor readings also matters. In the case of the "Redfish proxy" approach, it would require DB/Vault access for each individual sensor read.

I do not think we can and should provide such APIs in the way you suggested, which requires NICo to have complete and precise understanding of the power provisioning details of every single HW models from BMC (and NVUE for switch in the future). NICo are not good at this today and cannot be truly good at this tomorrow even if we wanted to. So, before someone can work out a general and validated power provisioning abstract API that works for all current and future HW components, we should not stretch ourselves and should instead be honest and simply proxy the REST BMC (and NVUE for switch) power provisioning APIs.

What we are able to do today is to know that these particular BMC (and NVUE) REST write endpoints will behave and not conflict with NICo's other write operations. And use enum to list and guard against them. I believe this PR did this.

@zhaozhongn zhaozhongn dismissed poroh’s stale review April 8, 2026 00:32

See my comments

@poroh poroh removed their request for review April 8, 2026 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants