Add CMIS FSM for DP decomission#608
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| # Set Explicit control bit to apply Custom Host SI settings | ||
| ec = 1 | ||
| # Skip rest if it's in decommission state machine | ||
| if not self.is_decomm_lead_lport(lport): |
There was a problem hiding this comment.
@AnoopKamath Fact that the non lead port wont reach here , is it more appropriate to use is_decomm_pending()?
There was a problem hiding this comment.
I'm ok with either is_decomm_pending() or is_decomm_lead_lport(), they do the same job here,
updated to is_decomm_pending()
|
|
||
| # Set the decommission pending flag to False and invoke CMIS reinit | ||
| # so that normal CMIS initialization can begin | ||
| if self.is_decomm_lead_lport(lport): |
There was a problem hiding this comment.
@AnoopKamath same comment. use is_decomm_pending() ?
There was a problem hiding this comment.
updated to is_decomm_pending
|
Today, We tried these changes with 202505 with 2x400G static breakout (800G => 2x400G) and hit the issue where lead port is getting stuck in CMIS_STATE_DP_INIT state, due to not getting ConfigSuccess, leading to timeout and eventually moving to FAILED state. While non-lead port gets stuck at INSERTED waiting for lead port to finish and eventually move to FAILED state. Though same port works fine without these changes. Lead port Logs: Non-lead Port Logs: |
@Keshavg-marvell :Could you please confirm if the decommissioning process is completed using the previous version of code? After executing the code below, you can set the module to CMIS_READY and refer to pages 10 and 11 to verify if the AppSel value was successfully set to 0. The new version of code performs the same function but with the appropriate delays. One subport is responsible for taking this action. Once you confirm that the decommissioning function works without errors in the upstream version, we can proceed with further debugging. |
… use is_decomm_pending() more
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…ine with CMIS spec" This reverts commit 02cf806.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
In the old version after "
@AnoopKamath In the previous version of code also api.decommission_all_datapaths() fails with config failure "ConfigRejectedPartialDataPath" , however appSel value get programmed to 0 and datapath state set to DataPathDeactivated . As it does force_cmis_reinit with desired application in the next retry, and as part of CMIS_STATE_AP_CONF state handling and it gets the config success and it works that way. In the upstream code, if I remove the below logic also it works without any issue. With this PR code, it gets the same config failure "ConfigRejectedPartialDataPath" while defaulting to AppSel=0. but it retry and attempt to program the same AppSel=0 and it fails again with config failure ConfigRejectedPartialDataPath and lead-port fails. Note: this transceiver is having default app is 100G-R4. Have we tried the logic where default application is not 800G-R8. Also few other question:
|
|
@vivekrnv @liat-grozovik will you be able to try this fix ? The issue was opened by Vivek sonic-net/sonic-buildimage#21603 |
@Keshavg-marvell : Does the ApplSel 0 get programmed to both SCS (page 10h) and ACS (page 11h)? We have a check in both newer and older code that should ignore if the current AppSel is 0. Also, please check with the optics vendor why there’s no config success with ApplSel 0. I understand the error with the older version since there’s no expected delays, but the new version of code has all the expected delays. So, we should see a config success. Answer inline to your questions:
Check
So, this PR code will decommission when a speed change occurs. We’re working on code changes to optimize this approach. Avoid decommissioning in cases where the width matches and multiple ApplSel configurations are in place. @longhuan-cisco is working on this and there will be separate PR for sonic-net/sonic-buildimage#23006 |
Verified and working as expected |
| == CMIS_STATE_FAILED | ||
| ) | ||
|
|
||
| def is_decommission_required(self, api, app_new): |
There was a problem hiding this comment.
@AnoopKamath This API will work only in cases where the module is operating with same application across all the datapath. Is that correct understanding? If there is a mix of application then?
There was a problem hiding this comment.
@prgeor : Your understanding is correct. This PR addresses DP INIT issues during decommission by incorporating delays recommended by modules, and prevents configuration failures during decommissioning. @longhuan-cisco is working on the code changes to handle mixed applications and speed changes (sonic-net/sonic-buildimage#23006) scenarios. Will send out the review soon.
|
Thanks Anoop for the detailed explanation. For your questions, SCS (page 10h) stage are it gets written but not to ACS (page 11h). we are following up with optics vendor why we are not getting the config success while decommission all datapath lanes. |
|
Hi @prgeor , please suggest whether this commit should be cherry-pick into 202505 branch, thanks |
|
Offline synced with Prince, this PR is not suggested to be included in 202505 branch, removing the label |
@dgsudharsan @yejianquan risky to take this backport. |

Description
Fix for : sonic-net/sonic-buildimage#21603
Add a logic to decommission data path by setting Appl Sel 0 using the existing CMIS SM.
When there are multiple subports/logical ports under the same physical port:
The non-lead port will remain in the CMIS inserted state until the lead port completes the decommissioning or it transitions to Failed state.
Motivation and Context
How Has This Been Tested?
Decommission lanes of the port which doesn't has required AppCode set to default
Additional Information (Optional)