Skip to content

Some INNOLIGHT 800G QSFP-DDs get stuck in CMIS DP_INIT intermittently #21603

@dgodwin-nokia

Description

@dgodwin-nokia

Some INNOLIGHT 800G QSFP-DDs get stuck in CMIS DP_INIT intermittently. After DP_INIT timeout happens, the transceiver will be permanently DOWN. This happens about half of the time on these QSFP-DDs. Found that these transceivers seem to have an issue with the current implementation of decommission_all_datapaths() in cmis.py.

decommission_all_datapaths() is called when the current application does not match to target application for the transceiver. Currently, it will call DEINIT, set the app ID to 0 (unused), and call INIT in one function without waiting for success. I am not sure if this is good as per the CMIS specificattion, but these transceivers do not behave well in this case:

'
def decommission_all_datapaths(self):
'''
Return True if all datapaths are successfully de-commissioned, False otherwise
'''
# De-init all datpaths
self.set_datapath_deinit((1 << self.NUM_CHANNELS) - 1)
# Decommision all lanes by apply AppSel=0
self.set_application(((1 << self.NUM_CHANNELS) - 1), 0, 0)
# Start with AppSel=0 i.e undo any default AppSel
self.scs_apply_datapath_init((1 << self.NUM_CHANNELS) - 1)

    dp_state = self.get_datapath_state()
    config_state = self.get_config_datapath_hostlane_status()

    for lane in range(self.NUM_CHANNELS):
        name = "DP{}State".format(lane + 1)
        if dp_state[name] != 'DataPathDeactivated':
            return False
        
        name = "ConfigStatusLane{}".format(lane + 1)
        if config_state[name] != 'ConfigSuccess':
            return False

    return True

'

These INNOLIGHT 800G QSFP-DDs will always succeed INIT and come up if the above function is modified so that a delay is added after DEINIT and INIT to wait for success. However, since this function is called inline in the CMIS state machine in xcvrd, a blocking wait should not be added here.
 
An example log from pmon when the issue occurs:

2024 Dec 7 03:50:57.965147 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: 800G, lanemask=0xff, CMIS state=INSERTED, Module state=ModuleReady, DP state={'DP1State': 'DataPathDeactivated', 'DP2State': 'DataPathDeactivated', 'DP3State': 'DataPathDeactivated', 'DP4State': 'DataPathDeactivated', 'DP5State': 'DataPathDeactivated', 'DP6State': 'DataPathDeactivated', 'DP7State': 'DataPathDeactivated', 'DP8State': 'DataPathDeactivated'}, appl 3 host_lane_count 8 retries=3
2024 Dec 7 03:50:57.976291 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: Setting appl=3
2024 Dec 7 03:50:57.986922 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: Setting host_lanemask=0xff
2024 Dec 7 03:50:58.008685 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: Setting media_lanemask=0xff
2024 Dec 7 03:50:58.062208 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: force Datapath reinit
2024 Dec 7 03:50:59.102274 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: 800G, lanemask=0xff, CMIS state=DP_DEINIT, Module state=ModuleReady, DP state={'DP1State': 'DataPathDeactivated', 'DP2State': 'DataPathDeactivated', 'DP3State': 'DataPathDeactivated', 'DP4State': 'DataPathDeactivated', 'DP5State': 'DataPathDeactivated', 'DP6State': 'DataPathDeactivated', 'DP7State': 'DataPathDeactivated', 'DP8State': 'DataPathDeactivated'}, appl 3 host_lane_count 8 retries=3
2024 Dec 7 03:50:59.126514 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: DpDeinit duration 1.0 secs, modulePwrUp duration 10.0 secs
2024 Dec 7 03:51:00.166925 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: 800G, lanemask=0xff, CMIS state=AP_CONFIGURED, Module state=ModuleReady, DP state={'DP1State': 'DataPathDeactivated', 'DP2State': 'DataPathDeactivated', 'DP3State': 'DataPathDeactivated', 'DP4State': 'DataPathDeactivated', 'DP5State': 'DataPathDeactivated', 'DP6State': 'DataPathDeactivated', 'DP7State': 'DataPathDeactivated', 'DP8State': 'DataPathDeactivated'}, appl 3 host_lane_count 8 retries=3
...
...
...
2024 Dec 7 03:51:09.654936 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: 800G, lanemask=0xff, CMIS state=DP_INIT, Module state=ModuleReady, DP state={'DP1State': 'DataPathDeactivated', 'DP2State': 'DataPathDeactivated', 'DP3State': 'DataPathDeactivated', 'DP4State': 'DataPathDeactivated', 'DP5State': 'DataPathDeactivated', 'DP6State': 'DataPathDeactivated', 'DP7State': 'DataPathDeactivated', 'DP8State': 'DataPathDeactivated'}, appl 3 host_lane_count 8 retries=3
2024 Dec 7 03:51:09.657384 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: timeout for 'ConfigSuccess'
2024 Dec 7 03:51:10.687747 uber NOTICE pmon#xcvrd[34]: CMIS: Ethernet184: 800G, lanemask=0xff, CMIS state=INSERTED, Module state=ModuleReady, DP state={'DP1State': 'DataPathDeactivated', 'DP2State': 'DataPathDeactivated', 'DP3State': 'DataPathDeactivated', 'DP4State': 'DataPathDeactivated', 'DP5State': 'DataPathDeactivated', 'DP6State': 'DataPathDeactivated', 'DP7State': 'DataPathDeactivated', 'DP8State': 'DataPathDeactivated'}, appl 3 host_lane_count 8 retries=4
2024 Dec 7 03:51:10.687747 uber ERR pmon#xcvrd[34]: CMIS: Ethernet184: FAILED

 
Observed this behavior on the following PNs:

Vendor PN: T-DP8CNH-NNO
Vendor PN: T-DP8CNT-NNO

Metadata

Metadata

Assignees

Labels

Triagedthis issue has been triaged

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions