Skip to content

GPU Accelerator Mgmt Interfaces Spec v1.1#52

Open
jcleung5549 wants to merge 2 commits into
mainfrom
spec
Open

GPU Accelerator Mgmt Interfaces Spec v1.1#52
jcleung5549 wants to merge 2 commits into
mainfrom
spec

Conversation

@jcleung5549

Copy link
Copy Markdown
Contributor

Hari's initial draft for review

Signed-off-by: John Leung <john.leung@intel.com>
Comment thread docs/GPU_Accelerator_Management_Interfaces_Specification_v1.1.md Outdated
The initial, static discovery of devices seeks to obtain primarily immutable information about the discrete accelerator device. This is done by reading the values from a dedicated FRU component (e.g. EEPROM) contained within the device.

#### 2.1.2 Transport Protocol
MCTP is the media independent protocol for communication among management controllers within the system. The corresponding section below describes the Bus owner relationship, the endpoint discovery, EID assignment, the bridges, and the EID pool assignments and corresponding flows.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incomplete

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njshah2 - what additional information do you want to see?

MCTP is the media independent protocol for communication among management controllers within the system. The corresponding section below describes the Bus owner relationship, the endpoint discovery, EID assignment, the bridges, and the EID pool assignments and corresponding flows.

#### 2.1.3 Attestation
Once static and MCTP endpoint discovery is complete, an important step is attestation of the device. SPDM (DSP0274) operates over MCTP as a standardized mechanism to authenticate hardware identity and measure firmware identity.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the EP have to be static and what is static?

Standards-based device firmware management requires support for PLDM Type 5 (DSP0267). DSP0267 provides a standardized way to query the version of the device's firmware (inventory) and to push firmware to the device (update).

### 2.2 UBB Accelerator Devices Manageability
Universal baseboard (UBB) designs are multi-GPU/accelerator devices with high-speed intra-GPU interconnects, PCI-e switches and retimers, and complex topologies. In those HW devices the expectation is the UBB must present a Redfish interface and may additionally support MCTP for high-frequency telemetry (e.g. thermal loop data) and attestation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any data to suggest the additional mctp interface over redfish is more performant. Why duplicate?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not recommend both until there is data to show that.


| Telemetry Data/Attribute (unordered) | Protocol | Use Case | Frequency |
|---|---|---|---|
| Monitoring (Sensors) | PLDM Type 2 (DSP0248) and MCTP OCP OEM | e.g. Thermal, power, counters… | 1s |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giving examples is not very helpful. if we are defining KPIs then they need to exact or they are useless.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrase to Thermal and power should be polled every 1 second.

| Monitoring (Sensors) | PLDM Type 2 (DSP0248) and MCTP OCP OEM | e.g. Thermal, power, counters… | 1s |
| Discovery | MCTP | | On demand |
| Support Dump | Future direction: PLDM Type DSP0242 | Comprehensive triage | On demand |
| Debug Logs | Future direction: PLDM Type (DSP0242) | | On demand |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are logs supported using 242?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably needs to be defined how multiple logs are formatted in the file or else this useless.

#### 3.2.2 Virtual FRU requirements
If a virtual FRU device is implemented, it **SHALL**:
- Be available for access on AUX power, and
- Be accompanied by a recovery design in the event that the virtual FRU fails.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure i understand what this means?

- Multi Record Area: Mandatory for any OEM extensions and the last area so that it extends.

#### 3.2.4 FRU access ownership
It is mandatory that a FRU data source is only accessed by a single agent (e.g. BMC on Host Platform). Implementations **SHALL NOT** allow multiple agents to access a FRU data source concurrently.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a vendor specific issue. why cant multiple entities access the FRU?

### 3.3 MCTP
This section describes the MCTP discovery and inventory flows and lists minimum required commands. This section also proposes and defines a high-level generic VDM command format for data that needs OEM extensions.

#### 3.3.1 Bus Owner and EIDs

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole section is redundant this is how the MCTP base protocol works unless they there are disjoint MCTP networks which the spec should not prevent. I think all this needs to say is support dynamic EID allocation from the BO

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; I don't think duplicating MCTP spec definitions is needed.

The “topmost bus owner” **SHALL** allocate all EIDs. When bridge devices are present, the topmost bus owner **SHALL** assign an EID pool to bridge devices, and bridge devices **SHALL** allocate EIDs to devices behind them from their assigned pool.

#### 3.3.2 MCTP Discovery
MCTP discovery/re-discovery **SHALL** occur on the following conditions:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is required. Discovery notify command can be used to rediscover any EP even if just the MCTP logic resets.

**Figure 1 – MCTP Discovery Flow (non-normative):** The source includes a flow chart showing the sequence of MCTP control commands used during discovery.

#### 3.3.3 MCTP command details (normative)
Implementations **SHALL** implement MCTP control protocol command support consistent with Table 5.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think all these commands should be normative. especially things like Query Rate Limit

| Temp | Numeric | • | | • | • | |
| Power | Numeric | • | | • | | |
| Total power | Numeric | • | | | | |
| VRs | Numeric | • | | • | | |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a VR sensor?

| Power | Numeric | • | | • | | |
| Total power | Numeric | • | | | | |
| VRs | Numeric | • | | • | | |
| Composite Healthget | State | • | | | | |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is health composite?

Reset-to-Defaults requirement:
- The Reset to Defaults state effecter **SHALL** use OEM State Set 40000 and **SHALL** support enumerated value “1” indicating “Reset to Default”. No other values are required.

#### 3.5.3 Entity Association

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant, mandatory from the spec.

| Manageability Objective | Technology | Standard |
|---|---|---|
| Static discovery | IPMI FRU | Platform Management FRU Information Storage Definition (rev. 1.3) |
| Transport protocol | MCTP | DMTF DSP0236 revision 1.3.1 or later |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest 1.3.3 or later (or 1.4.0 if it's out in time).

---

## Executive Summary (retained)
Management of GPUs is not standardized, resulting in significant effort and time to onboard each new HW design. The lack of standardization is also a burden on suppliers who must accommodate varying requirements from their customers. This document describes industry standard formats and protocols that make it easier for CSPs (Cloud Service Providers, hyperscalers) to onboard new GPU and accelerator designs with less toil and faster time to market while reducing manageability permutations for suppliers.

@austin42 austin42 Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to CSPs, I would say servers in general. Besides our CSP use cases, this standardization is beneficial for servers destined for enterprise datacenters or even edge deployments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "... and traditional datacenters"

Static discovery of the discrete accelerator device is accomplished by accessing an I2C/I3C FRU EEPROM or, if not present, through virtual FRU devices.

#### 3.2.1 FRU data source
A discrete accelerator device **SHALL** expose a FRU data source using one of the following mechanisms:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the driving motivator for dictating an I2C/I3C EEPROM or a virtual device? If external entities are querying FRU info via the discrete accelerator itself, does the lower interface even matter?

- Be accompanied by a recovery design in the event that the virtual FRU fails.

#### 3.2.3 FRU format, capacity, and areas
The FRU device **SHALL** return IPMI 1.3 FRU formatted data.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason DSP0220 was not considered as a format? Should the spec allow for either format?

**Figure 12 – UBB Redfish Model (non-normative):** The resource label “UBB” represents the complete Accelerator subsystem/enclosure within the server and “OAM_X” represents each accelerator module assembly.

Normative requirements:
- UBB accelerator enclosure **SHALL** be modeled as a single logical subsystem.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the requirement to model the UBB enclosure as a "single logical subsystem" stop at the subsystem's BMC, or does this requirement carry forward to the northbound Redfish interface from the system? If it carries forward to the northbound Redfish interface from the system, then this breaks existing Redfish modeling patterns for the ComputerSystem resource.

Comment on lines +492 to +493
- Each accelerator module **SHALL** be represented as a distinct Redfish resource.
- Relationships **SHALL** be modeled using Redfish links.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the other requirements in this section even matter with the usage of the Redfish profile?

- Each accelerator module **SHALL** be represented as a distinct Redfish resource.
- Relationships **SHALL** be modeled using Redfish links.

#### 4.1.2 Location Objects

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be covered in the Redfish profile?

- When `RawBitStream` is not requested, responder **SHALL** use standard attributes such as `MeasurementType` and `MeasurementHashAlgorithm` and provide digest data.

Action URIs (retained):
- `/redfish/v1/ComponentIntegrity/UBB/Actions/OCPGPUMgmtWG.ComponentIntegrity.v1_0_0.SPDMGetRawBitStream`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really (and I mean really) want to standardize on non-standard actions/properties... We at least need to fix the OEM definitions. The action URI "/redfish/v1/ComponentIntegrity/UBB/Actions/OCPGPUMgmtWG.ComponentIntegrity.v1_0_0.SPDMGetRawBitStream" is not valid.

Signed-off-by: John Leung <john.leung@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants