feat: add synchronous Redfish POST/PATCH proxy for the Carbide-DPS agent#771
feat: add synchronous Redfish POST/PATCH proxy for the Carbide-DPS agent#771spydaNVIDIA wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-04-01 17:27:53 UTC | Commit: 241bc5e |
Matthias247
left a comment
There was a problem hiding this comment.
I don't fully understand the distinction between the first-level check (carbide-dps-agent) and what goes into the config file. Does a caller need to have both roles? First carbide-dps-agent to access and then what is in the config file? Or can carbide-dps-agent access everything, but things in the config file can also access specific path. If its the latter, then I'm wondering if we can just go for 1 of the 2 access patterns, so that things are easier to understand.
The former. We need the first level check to enable DPS agent to talk to the specific gRPC endpoints (RedfishPost etc). Although it is OK for the agent to get full access to the GET proxy, we do not want the DPS agent to have admin level access to all POST/PATCH actions against BMCs managed by carbide. So, we introduce the config file to define the POST & PATCH methods that the DPS agent is allowed to hit. Here is an example from forged:
|
Thanks for the explanation! Totally agree we want to limit the callable methods. The only thing which I find surprising with this setup is that the internal_rbac_rules only allow a single caller the methods. And if that's true, then the per-caller permissions in the config file would not apply. Should the internal rules just allow any caller, since the actual checks happen via config file? |
The internal_rbac_rules allow both ForgeAdminCli and CarbideDpsAgent access to these RPCs:
The allowlist exists to constrain automated service-to-service callers (like DPS) that should only touch specific BMC endpoints. External users (like ForgeAdminCLI) bypass the URI allowlist. I was thinking in the future, we may have other services that need proxy access to a different set of actions. |
|
I feel we should not let any other service (a part from NICo itself), to do any mutation of any resource under it controll via direct proxy. |
Linked the design in a PM. @yoks please lmk if it looks OK after reviewing |
47878e1 to
0b50fd2
Compare
|
With this design we will never be able organically integrate external calls into machine management infrastructure because "what is alllowed" is just piece of configuration. For example, somebody can decide to expose power management interface for external services and nothing prevent one to do this. And carbide doesn't have a single chance to handle it correctly without additional level of patches. This is specific concern that I raised in discussion of upstream project task #282 |
That's correct. With the current implementation, software integrators (taking NICo and power provisioning software as components to build their solutions) will need to take responsibility of making sure their configs to be sane. But this is the same for other configs as well fundamentally. And I tend to believe the current exposed power ceiling setting interface should be "safe" from the state machine perspective. Eventually, I hope we will be able to figure out a long-term solution to the general problem as you described. |
|
I mean if it just specific API subset, what stops us exposing them in Carbide and using them? This helps us in future to wire them via state management. |
I think we are doing that. Just via config not code? |
I trust you that settings you are going to expose are safe. This is not the matter of cocnern and also not a part of this PR. I'm discussing specifically idea of direct exposure of Redfish interface through GRPC API of Carbide. I believe that once we provide this API we cannot control who and how will use this API. Code doesn't provide any guardrails to parts of Redfish tree that can interfere into carbide logic and therefore anybody can write cool tool that reboots server when state machine doesn't expect it. This is injection ultimate source of inconsistency in machine state management. |
Doesnt the config provide the guardrail that you are detailing here? To reboot the server, the software integrator would have had to explicitly enable additional services (outside of NICo) to issue the POST to reboot servers via the config. |
This is what I meant by using "config" than "code" to achieve the goal. This is OSS, so anyone can change the code as easily as the config in their deployment. |
Software integrator may not know what endpoints are safe to enable and what endpoints are not. And opposite, we as carbide developers don't know which endpoints will be exposed by software integrators. And two issues with this:
|
I do not disagree with these points. However, this is OSS software we are talking about. There will be many things we cannot 100% assume as people can change code and config equally easily. And from an integration point of view, they should need to do some changes. This is by no means safe. |
This is not the best argument to expose Raw API that breaks our claims about system state. To me it looks similar to exposing raw SQL interface to DB. It will be killer feature for software integrator and nightmare for long-term support. |
I disagree. We exposed an API with a config that's safe. We support that. Any other usage is beyond our control and we will not "support" it. Could it be harder for the integrator? Yes, but that's the cost of this advanced features. The balancing and trade-off calculation will be up to them. |
With config it is as raw SQL with regexp validator of allowed expressions in config. But I think that all arguments are on the table. Somebody will decide what direction carbide should take. |
poroh
left a comment
There was a problem hiding this comment.
See comments in conversations.
I agree with all the technical concerns. I think we might be able to solve all these elegantly in the future. But we could end up not being able to. Regardless, for now we need to provide these functionalities for very practical reasons. And I do not see a obviously better alternative. Hence the decision to go this way. |
Obvious alternative is to expose specific RPC command(s) to access those endpoints you need for carbide DPS. It can be modelled as enum: enum RedfishCommand { You can easy extend this command in future but still keep full control over what is allowed to be done via API. |
For that, surely. I have talked to @spydaNVIDIA and he will switch from using string value to an enum value in gRPC interface. |
There's a lot of discussion I haven't read all of, but I totally agree here. We shouldn't be exposing an arbitrary redfish proxy for POST/PUT/etc requests, but instead expressing the logical operations via our own API and letting callers call that directly (hiding redfish as an implementation detail.) The PR as it is now places a ton of importance on creating a sufficiently secure config file with a redfish action allowlist, and I think the software ought to be more opinionated than that. |
af4340c to
c06a12a
Compare
This change is not what I meant by changing the API. And with this change, it breaks abstractions because the enum contains specific URIs for H100/GB200/GB300. Carbide provides machines as an abstraction to users of its interface. So instead of giving users proxy access to BMC, Carbide should provide a high-level interface. In the case of DPS, it is:
It should not provide an interface for Redfish exploration of each individual machine. In addition, the access footprint for sensor readings also matters. In the case of the "Redfish proxy" approach, it would require DB/Vault access for each individual sensor read. |
I do not think we can and should provide such APIs in the way you suggested, which requires NICo to have complete and precise understanding of the power provisioning details of every single HW models from BMC (and NVUE for switch in the future). NICo are not good at this today and cannot be truly good at this tomorrow even if we wanted to. So, before someone can work out a general and validated power provisioning abstract API that works for all current and future HW components, we should not stretch ourselves and should instead be honest and simply proxy the REST BMC (and NVUE for switch) power provisioning APIs. What we are able to do today is to know that these particular BMC (and NVUE) REST write endpoints will behave and not conflict with NICo's other write operations. And use enum to list and guard against them. I believe this PR did this. |
…lso allow the agent to access the GET proxy
…RedfishProxy rather than as a raw string
Description
The Carbide-DPS agent must authenticate via mTLS using a SPIFFE X.509 certificate with the service identity
carbide-dps-agent. Carbide authorizes this identity through both internal RBAC rules and Casbin policy to call RedfishBrowse, RedfishPost, and RedfishPatch.Allow the Carbide-DPS agent to perform Redfish GET, POST, and PATCH operations on BMCs through Carbide's gRPC API.
POST and PATCH requests are subject to per-principal URI allowlists configured in
carbide-api-config.toml, restricting which Redfish endpoints a given service principal can target. External (certificate-based) users bypass the allowlist since the RBAC layer already gates their access.Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes