feat: add synchronous Redfish POST/PATCH proxy for the Carbide-DPS agent by spydaNVIDIA · Pull Request #771 · NVIDIA/ncx-infra-controller-core

spydaNVIDIA · 2026-04-01T17:25:52Z

Description

The Carbide-DPS agent must authenticate via mTLS using a SPIFFE X.509 certificate with the service identity carbide-dps-agent. Carbide authorizes this identity through both internal RBAC rules and Casbin policy to call RedfishBrowse, RedfishPost, and RedfishPatch.

Allow the Carbide-DPS agent to perform Redfish GET, POST, and PATCH operations on BMCs through Carbide's gRPC API.

POST and PATCH requests are subject to per-principal URI allowlists configured in carbide-api-config.toml, restricting which Redfish endpoints a given service principal can target. External (certificate-based) users bypass the allowlist since the RBAC layer already gates their access.

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

github-actions · 2026-04-01T17:27:54Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-04-01 17:27:53 UTC | Commit: 241bc5e}

crates/rpc/proto/forge.proto

Matthias247

I don't fully understand the distinction between the first-level check (carbide-dps-agent) and what goes into the config file. Does a caller need to have both roles? First carbide-dps-agent to access and then what is in the config file? Or can carbide-dps-agent access everything, but things in the config file can also access specific path. If its the latter, then I'm wondering if we can just go for 1 of the 2 access patterns, so that things are easier to understand.

crates/api/src/auth/internal_rbac_rules.rs

crates/api/src/cfg/file.rs

crates/api/casbin-policy.csv

spydaNVIDIA · 2026-04-01T18:40:33Z

I don't fully understand the distinction between the first-level check (carbide-dps-agent) and what goes into the config file. Does a caller need to have both roles? First carbide-dps-agent to access and then what is in the config file? Or can carbide-dps-agent access everything, but things in the config file can also access specific path. If its the latter, then I'm wondering if we can just go for 1 of the 2 access patterns, so that things are easier to understand.

The former. We need the first level check to enable DPS agent to talk to the specific gRPC endpoints (RedfishPost etc). Although it is OK for the agent to get full access to the GET proxy, we do not want the DPS agent to have admin level access to all POST/PATCH actions against BMCs managed by carbide. So, we introduce the config file to define the POST & PATCH methods that the DPS agent is allowed to hit.

Here is an example from forged:
https://gitlab-master.nvidia.com/nvmetal/forged/-/merge_requests/5168

[redfish_proxy.carbide-dps-agent] allowed_post_uris = [ "/redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.EnableProfiles", "/redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.DisableProfiles", "/redfish/v1/Managers/BMC/NodeManager/Domains", ] allowed_patch_uris = [ "/redfish/v1/Managers/BMC/NodeManager/Domains/{id}", ]

Matthias247 · 2026-04-01T20:48:23Z

The former. We need the first level check to enable DPS agent to talk to the specific gRPC endpoints (RedfishPost etc). Although it is OK for the agent to get full access to the GET proxy, we do not want the DPS agent to have admin level access to all POST/PATCH actions against BMCs managed by carbide. So, we introduce the config file to define the POST & PATCH methods that the DPS agent is allowed to hit.

Thanks for the explanation! Totally agree we want to limit the callable methods. The only thing which I find surprising with this setup is that the internal_rbac_rules only allow a single caller the methods. And if that's true, then the per-caller permissions in the config file would not apply. Should the internal rules just allow any caller, since the actual checks happen via config file?

spydaNVIDIA · 2026-04-01T21:03:37Z

The former. We need the first level check to enable DPS agent to talk to the specific gRPC endpoints (RedfishPost etc). Although it is OK for the agent to get full access to the GET proxy, we do not want the DPS agent to have admin level access to all POST/PATCH actions against BMCs managed by carbide. So, we introduce the config file to define the POST & PATCH methods that the DPS agent is allowed to hit.

Thanks for the explanation! Totally agree we want to limit the callable methods. The only thing which I find surprising with this setup is that the internal_rbac_rules only allow a single caller the methods. And if that's true, then the per-caller permissions in the config file would not apply. Should the internal rules just allow any caller, since the actual checks happen via config file?

The internal_rbac_rules allow both ForgeAdminCli and CarbideDpsAgent access to these RPCs:

x.perm("RedfishBrowse", vec![ForgeAdminCLI, CarbideDpsAgent]); x.perm("RedfishPost", vec![ForgeAdminCLI, CarbideDpsAgent]); x.perm("RedfishPatch", vec![ForgeAdminCLI, CarbideDpsAgent]);

The allowlist exists to constrain automated service-to-service callers (like DPS) that should only touch specific BMC endpoints. External users (like ForgeAdminCLI) bypass the URI allowlist. I was thinking in the future, we may have other services that need proxy access to a different set of actions.

yoks · 2026-04-01T21:43:40Z

I feel we should not let any other service (a part from NICo itself), to do any mutation of any resource under it controll via direct proxy.
This goes against consistent state management and FSM design.

spydaNVIDIA · 2026-04-01T23:21:04Z

I feel we should not let any other service (a part from NICo itself), to do any mutation of any resource under it controll via direct proxy. This goes against consistent state management and FSM design.

Linked the design in a PM. @yoks please lmk if it looks OK after reviewing

poroh · 2026-04-02T14:56:57Z

With this design we will never be able organically integrate external calls into machine management infrastructure because "what is alllowed" is just piece of configuration.

For example, somebody can decide to expose power management interface for external services and nothing prevent one to do this. And carbide doesn't have a single chance to handle it correctly without additional level of patches.

This is specific concern that I raised in discussion of upstream project task #282

zhaozhongn · 2026-04-02T17:27:27Z

With this design we will never be able organically integrate external calls into machine management infrastructure because "what is alllowed" is just piece of configuration.

For example, somebody can decide to expose power management interface for external services and nothing prevent one to do this. And carbide doesn't have a single chance to handle it correctly without additional level of patches.

This is specific concern that I raised in discussion of upstream project task #282

That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective.

Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described.

yoks · 2026-04-02T17:54:59Z

I mean if it just specific API subset, what stops us exposing them in Carbide and using them? This helps us in future to wire them via state management.

zhaozhongn · 2026-04-02T20:54:24Z

I mean if it just specific API subset, what stops us exposing them in Carbide and using them? This helps us in future to wire them via state management.

I think we are doing that. Just via config not code?

poroh · 2026-04-02T21:24:13Z

That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective.

Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described.

I trust you that settings you are going to expose are safe. This is not the matter of cocnern and also not a part of this PR. I'm discussing specifically idea of direct exposure of Redfish interface through GRPC API of Carbide. I believe that once we provide this API we cannot control who and how will use this API. Code doesn't provide any guardrails to parts of Redfish tree that can interfere into carbide logic and therefore anybody can write cool tool that reboots server when state machine doesn't expect it.

This is injection ultimate source of inconsistency in machine state management.

spydaNVIDIA · 2026-04-02T21:30:13Z

That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective.
Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described.

I trust you that settings you are going to expose are safe. This is not the matter of cocnern and also not a part of this PR. I'm discussing specifically idea of direct exposure of Redfish interface through GRPC API of Carbide. I believe that once we provide this API we cannot control who and how will use this API. Code doesn't provide any guardrails to parts of Redfish tree that can interfere into carbide logic and therefore anybody can write cool tool that reboots server when state machine doesn't expect it.

This is injection ultimate source of inconsistency in machine state management.

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

zhaozhongn · 2026-04-02T21:38:34Z

That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective.
Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described.

I trust you that settings you are going to expose are safe. This is not the matter of cocnern and also not a part of this PR. I'm discussing specifically idea of direct exposure of Redfish interface through GRPC API of Carbide. I believe that once we provide this API we cannot control who and how will use this API. Code doesn't provide any guardrails to parts of Redfish tree that can interfere into carbide logic and therefore anybody can write cool tool that reboots server when state machine doesn't expect it.
This is injection ultimate source of inconsistency in machine state management.

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

This is what I meant by using "config" than "code" to achieve the goal. This is OSS, so anyone can change the code as easily as the config in their deployment.

poroh · 2026-04-02T21:54:00Z

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

Software integrator may not know what endpoints are safe to enable and what endpoints are not. And opposite, we as carbide developers don't know which endpoints will be exposed by software integrators. And two issues with this:

For integrator it may work but randomly fail because of state machine interference.
For us, we don't know what invariant we are breaking by changing our code. Lets say somebody decided that exposing BIOS configuration is safe and use it. After some time we decide to change some parameters of BIOS inside carbide. And we have no clue which carbide deployment out there we are breaking by this action.

zhaozhongn · 2026-04-02T22:02:48Z

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

Software integrator may not know what endpoints are safe to enable and what endpoints are not. And opposite, we as carbide developers don't know which endpoints will be exposed by software integrators. And two issues with this:

For integrator it may work but randomly fail because of state machine interference.

For us, we don't know what invariant we are breaking by changing our code. Lets say somebody decided that exposing BIOS configuration is safe and use it. After some time we decide to change some parameters of BIOS inside carbide. And we have no clue which carbide deployment out there we are breaking by this action.

I do not disagree with these points. However, this is OSS software we are talking about. There will be many things we cannot 100% assume as people can change code and config equally easily. And from an integration point of view, they should need to do some changes. This is by no means safe.

poroh · 2026-04-02T22:03:15Z

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

This is what I meant by using "config" than "code" to achieve the goal. This is OSS, so anyone can change the code as easily as the config in their deployment.

This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support.

zhaozhongn · 2026-04-02T22:06:58Z

Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config.

This is what I meant by using "config" than "code" to achieve the goal. This is OSS, so anyone can change the code as easily as the config in their deployment.

This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support.

I disagree. We exposed an API with a config that's safe. We support that. Any other usage is beyond our control and we will not "support" it. Could it be harder for the integrator? Yes, but that's the cost of this advanced features. The balancing and trade-off calculation will be up to them.

poroh · 2026-04-02T22:14:17Z

This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support.

I disagree. We exposed an API with a config that's safe. We support that. Any other usage is beyond our control and we will not "support" it. Could it be harder for the integrator? Yes, but that's the cost of this advanced features. The balancing and trade-off calculation will be up to them.

With config it is as raw SQL with regexp validator of allowed expressions in config. But I think that all arguments are on the table. Somebody will decide what direction carbide should take.

poroh

See comments in conversations.

zhaozhongn · 2026-04-02T22:20:58Z

This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support.

I disagree. We exposed an API with a config that's safe. We support that. Any other usage is beyond our control and we will not "support" it. Could it be harder for the integrator? Yes, but that's the cost of this advanced features. The balancing and trade-off calculation will be up to them.

With config it is as raw SQL with regexp validator of allowed expressions in config. But I think that all arguments are on the table. Somebody will decide what direction carbide should take.

I agree with all the technical concerns. I think we might be able to solve all these elegantly in the future. But we could end up not being able to. Regardless, for now we need to provide these functionalities for very practical reasons. And I do not see a obviously better alternative. Hence the decision to go this way.

poroh · 2026-04-02T22:31:22Z

I agree with all the technical concerns. I think we might be able to solve all these elegantly in the future. But we could end up not being able to. Regardless, for now we need to provide these functionalities for very practical reasons. And I do not see a obviously better alternative. Hence the decision to go this way.

Obvious alternative is to expose specific RPC command(s) to access those endpoints you need for carbide DPS. It can be modelled as enum:

enum RedfishCommand {
UpdateNodeManagerDomains(...),
NvidiaWorkloadProfileAction(...)
}

You can easy extend this command in future but still keep full control over what is allowed to be done via API.

zhaozhongn · 2026-04-03T17:16:30Z

I agree with all the technical concerns. I think we might be able to solve all these elegantly in the future. But we could end up not being able to. Regardless, for now we need to provide these functionalities for very practical reasons. And I do not see a obviously better alternative. Hence the decision to go this way.

Obvious alternative is to expose specific RPC command(s) to access those endpoints you need for carbide DPS. It can be modelled as enum:

enum RedfishCommand { UpdateNodeManagerDomains(...), NvidiaWorkloadProfileAction(...) }

You can easy extend this command in future but still keep full control over what is allowed to be done via API.

For that, surely. I have talked to @spydaNVIDIA and he will switch from using string value to an enum value in gRPC interface.

kensimon · 2026-04-06T17:35:08Z

Obvious alternative is to expose specific RPC command(s) to access those endpoints you need for carbide DPS. It can be modelled as enum:

enum RedfishCommand { UpdateNodeManagerDomains(...), NvidiaWorkloadProfileAction(...) }

There's a lot of discussion I haven't read all of, but I totally agree here. We shouldn't be exposing an arbitrary redfish proxy for POST/PUT/etc requests, but instead expressing the logical operations via our own API and letting callers call that directly (hiding redfish as an implementation detail.)

The PR as it is now places a ton of importance on creating a sufficiently secure config file with a redfish action allowlist, and I think the software ought to be more opinionated than that.

spydaNVIDIA · 2026-04-07T22:20:17Z

@poroh @kensimon I updated the MR to reflect the above discussion.

poroh · 2026-04-07T23:53:29Z

@poroh @kensimon I updated the MR to reflect the above discussion.

This change is not what I meant by changing the API. And with this change, it breaks abstractions because the enum contains specific URIs for H100/GB200/GB300. Carbide provides machines as an abstraction to users of its interface. So instead of giving users proxy access to BMC, Carbide should provide a high-level interface. In the case of DPS, it is:

Access to the latest power sensor readings (should be plumbed from the health service)
The ability to set up power profiles and power limits for CPU/GPU

It should not provide an interface for Redfish exploration of each individual machine.

In addition, the access footprint for sensor readings also matters. In the case of the "Redfish proxy" approach, it would require DB/Vault access for each individual sensor read.

zhaozhongn · 2026-04-08T00:30:59Z

@poroh @kensimon I updated the MR to reflect the above discussion.

This change is not what I meant by changing the API. And with this change, it breaks abstractions because the enum contains specific URIs for H100/GB200/GB300. Carbide provides machines as an abstraction to users of its interface. So instead of giving users proxy access to BMC, Carbide should provide a high-level interface. In the case of DPS, it is:

Access to the latest power sensor readings (should be plumbed from the health service)

The ability to set up power profiles and power limits for CPU/GPU

It should not provide an interface for Redfish exploration of each individual machine.

In addition, the access footprint for sensor readings also matters. In the case of the "Redfish proxy" approach, it would require DB/Vault access for each individual sensor read.

I do not think we can and should provide such APIs in the way you suggested, which requires NICo to have complete and precise understanding of the power provisioning details of every single HW models from BMC (and NVUE for switch in the future). NICo are not good at this today and cannot be truly good at this tomorrow even if we wanted to. So, before someone can work out a general and validated power provisioning abstract API that works for all current and future HW components, we should not stretch ourselves and should instead be honest and simply proxy the REST BMC (and NVUE for switch) power provisioning APIs.

What we are able to do today is to know that these particular BMC (and NVUE) REST write endpoints will behave and not conflict with NICo's other write operations. And use enum to list and guard against them. I believe this PR did this.

See my comments

…lso allow the agent to access the GET proxy

…dfishBrowse

…RedfishProxy rather than as a raw string

spydaNVIDIA self-assigned this Apr 1, 2026

spydaNVIDIA requested a review from a team as a code owner April 1, 2026 17:25

spydaNVIDIA requested review from aswaroop-nv and ddejong-spec April 1, 2026 17:26

spydaNVIDIA force-pushed the pyda_dps branch from 241bc5e to 81dc964 Compare April 1, 2026 17:27

Matthias247 reviewed Apr 1, 2026

View reviewed changes

crates/rpc/proto/forge.proto Outdated Show resolved Hide resolved

Matthias247 reviewed Apr 1, 2026

View reviewed changes

crates/api/src/auth/internal_rbac_rules.rs Outdated Show resolved Hide resolved

crates/api/src/cfg/file.rs Show resolved Hide resolved

crates/api/casbin-policy.csv Outdated Show resolved Hide resolved

spydaNVIDIA force-pushed the pyda_dps branch from 81dc964 to 9ca83cf Compare April 1, 2026 23:18

spydaNVIDIA force-pushed the pyda_dps branch 2 times, most recently from 47878e1 to 0b50fd2 Compare April 2, 2026 05:43

spydaNVIDIA requested a review from Matthias247 April 2, 2026 18:37

zhaozhongn approved these changes Apr 2, 2026

View reviewed changes

poroh previously requested changes Apr 2, 2026

View reviewed changes

spydaNVIDIA force-pushed the pyda_dps branch from 0b50fd2 to 9a34c2b Compare April 2, 2026 22:18

spydaNVIDIA force-pushed the pyda_dps branch 2 times, most recently from af4340c to c06a12a Compare April 7, 2026 22:18

spydaNVIDIA requested a review from poroh April 7, 2026 22:20

spydaNVIDIA added 6 commits April 7, 2026 17:54

feat: add redfish POST and PATCH proxy for the DPS agent to access. a…

320d21d

…lso allow the agent to access the GET proxy

chore: update name from carbide-dps to carbide-dps-agent

66e102c

fix: address MR feedback

0cf483e

chore: re-use the redfish_proxy_get logic in the implementation of Re…

b9eafcb

…dfishBrowse

fix: addres MR feedback and have the URI specified as an enum to the …

bb0ba55

…RedfishProxy rather than as a raw string

fix: address fmt issues

7dd27a6

spydaNVIDIA force-pushed the pyda_dps branch from c06a12a to 7dd27a6 Compare April 8, 2026 00:54

poroh removed their request for review April 8, 2026 00:55

Conversation

spydaNVIDIA commented Apr 1, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

github-actions bot commented Apr 1, 2026

🔐 TruffleHog Secret Scan

Uh oh!

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spydaNVIDIA commented Apr 1, 2026

Uh oh!

Matthias247 commented Apr 1, 2026

Uh oh!

spydaNVIDIA commented Apr 1, 2026

Uh oh!

yoks commented Apr 1, 2026

Uh oh!

spydaNVIDIA commented Apr 1, 2026

Uh oh!

poroh commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaozhongn commented Apr 2, 2026

Uh oh!

yoks commented Apr 2, 2026

Uh oh!

zhaozhongn commented Apr 2, 2026

Uh oh!

poroh commented Apr 2, 2026

Uh oh!

spydaNVIDIA commented Apr 2, 2026

Uh oh!

zhaozhongn commented Apr 2, 2026

Uh oh!

poroh commented Apr 2, 2026

Uh oh!

zhaozhongn commented Apr 2, 2026

Uh oh!

poroh commented Apr 2, 2026

Uh oh!

zhaozhongn commented Apr 2, 2026

Uh oh!

poroh commented Apr 2, 2026

Uh oh!

poroh left a comment

Choose a reason for hiding this comment

Uh oh!

zhaozhongn commented Apr 2, 2026

Uh oh!

poroh commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaozhongn commented Apr 3, 2026

Uh oh!

kensimon commented Apr 6, 2026

Uh oh!

spydaNVIDIA commented Apr 7, 2026

Uh oh!

poroh commented Apr 7, 2026

Uh oh!

zhaozhongn commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

poroh commented Apr 2, 2026 •

edited

Loading

poroh commented Apr 2, 2026 •

edited

Loading

zhaozhongn commented Apr 8, 2026 •

edited

Loading