Skip to content

[Smartswitch] Fix incorrect reporting of data plane and control plane by DPU#606

Merged
rlhui merged 8 commits into
sonic-net:masterfrom
gpunathilell:dpu_midplane_fix
Jun 13, 2025
Merged

[Smartswitch] Fix incorrect reporting of data plane and control plane by DPU#606
rlhui merged 8 commits into
sonic-net:masterfrom
gpunathilell:dpu_midplane_fix

Conversation

@gpunathilell

@gpunathilell gpunathilell commented Apr 10, 2025

Copy link
Copy Markdown
Contributor

Description

Motivation and Context

Change was done so that the DPU states are updated on the following conditions:
midplane_state -> The switch erases the DPU_STATE table when the midplane state changes to down, So as to remove the stale states of control plane and data plane
control_plane_state and data_plane_state Subscribe to the DPU_STATE table, if there is an update, where only midplane_state is updated to the table, (meaning there is no control plane or data plane state present in the table). Then update the states back to the table, (If DPUs are online, this means that the states have been removed by switch

How Has This Been Tested?

Set midplane interface to down on DPU:
ifconfig eth0-midplane down -> Observe that control plane and dataplane states are set to down
ifconfig eth0-midplane up -> Observe that control plane and dataplane states are added back

Additional Information (Optional)

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@gpunathilell

Copy link
Copy Markdown
Contributor Author

@vvolam @rameshraghupathy Please review

Comment thread sonic-chassisd/scripts/chassisd Outdated
def update_dpu_state(self, key, state):
"""
Update specific DPU state fields in chassisStateDB using the given key.
If state is 'down', delete the table first before setting new values.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell @vvolam
The original intent is to "Update specific DPU state fields in chassisStateDB using the given key." and not to delete the table. Doing so, will wipe out the "CP, DP" details that was there around the time of midplane failure. The intent was to preserved that data which has important information that will help debug easily.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell I think you are deleting only the entry in the table right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state is only changed to down at this point, We are not deleting any entries

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell Then can you fix the comment accordingly? It says you are deleting the table first.

Comment thread sonic-chassisd/scripts/chassisd Outdated
if not self.chassis_state_db:
self.chassis_state_db = daemon_base.db_connect("CHASSIS_STATE_DB")

# If state is down, delete the table first

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell @vvolam We shouldn't delete the entire table. If there is a need to delete the table (I need to understand the requirement fully), save the CP/DP values and restore them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I guess Gagan is deleting the table entry here and not entire table as per my understanding. I think entry is deleted here and added at line 799?

@gpunathilell, But why do we have to delete entry and updating entry is not helping here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state is only changed to down at this point, We are not deleting any entries

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from Copilot May 20, 2025 20:24

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the incorrect reporting of data plane and control plane states for DPUs by updating the state update logic and enhancing test coverage. Key changes include new tests for various DPU state management scenarios, additions to the mock framework for DB connector retrieval, and modifications to the state update logic in the chassisd script to correctly handle CHASSIS_STATE_DB updates.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
sonic-chassisd/tests/test_dpu_chassisd.py Added tests validating deletion, state tracking, duplicate update avoidance, and handling of state changes for DPUs.
sonic-chassisd/tests/test_chassisd.py Added a test to verify midplane state change updates (module updater).
sonic-chassisd/tests/mock_swsscommon.py Introduced getDbConnector() in the mock DB connector for consistency with production code.
sonic-chassisd/scripts/chassisd Updated DPU state and module updater logic including the use of shared constants and refined state update conditions.

Comment on lines +1441 to +1445
update_required = False

self.dpu_state_updater.update_state()
for s in selectable:
result = s.pop()
update_required = True # If there is any selectable object, we need to update the state

Copilot AI May 20, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'update_required' flag is being reinitialized for each selectable within the loop instead of accumulating conditions from all objects. Consider initializing 'update_required' to false before the loop and then OR-ing or accumulating the condition for each selectable to ensure proper state update decision.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update_required is set this way, because it shows a clear distinction on when we perform the actual update, and when the update is disabled, I think implementing it this way is more clearer to understand

@vvolam vvolam left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than minor comment fix

Comment thread sonic-chassisd/scripts/chassisd Outdated
def update_dpu_state(self, key, state):
"""
Update specific DPU state fields in chassisStateDB using the given key.
If state is 'down', delete the table first before setting new values.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell Then can you fix the comment accordingly? It says you are deleting the table first.

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@gpunathilell

Copy link
Copy Markdown
Contributor Author

@vvolam Please merge this PR

@rlhui rlhui merged commit 0c2ced0 into sonic-net:master Jun 13, 2025
5 checks passed
@mssonicbld

Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202505: #626

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants