Skip to content

Delay DOM polling until all ports are initialized#614

Open
prgeor wants to merge 2 commits into
sonic-net:masterfrom
prgeor:delay-dom
Open

Delay DOM polling until all ports are initialized#614
prgeor wants to merge 2 commits into
sonic-net:masterfrom
prgeor:delay-dom

Conversation

@prgeor

@prgeor prgeor commented May 13, 2025

Copy link
Copy Markdown
Collaborator

Description

Delay DOM polling until all ports are initialized

Motivation and Context

Platforms with large radix like 512 ports may take more time to initialize and complete the CMIS datapath state machine. Since DOM polling is quite expensive on IO bound operation, it can result in contention with CMIS manager task which is initializing the port

How Has This Been Tested?

Tested this on a platform with 512 100G ports with 800G DR8 optics and see a reduction of overall link up time by around 4mins.

Additional Information (Optional)

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@prgeor prgeor requested a review from Junchao-Mellanox May 13, 2025 05:25
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@Junchao-Mellanox

Copy link
Copy Markdown
Collaborator

@moshemos @dgsudharsan for awareness

@mihirpat1 mihirpat1 requested a review from Copilot May 13, 2025 18:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR delays DOM polling until all ports finish initialization, ensuring that the CMIS datapath state machine is complete before expensive polling begins.

  • Introduces a new function wait_port_initialization that loops until all logical ports are either initialized or removed from consideration.
  • Calls the new waiting function in the task_worker() method to delay DOM monitoring until all ports reach a terminal CMIS state.

Comment thread sonic-xcvrd/xcvrd/dom/dom_mgr.py Outdated
Comment thread sonic-xcvrd/xcvrd/dom/dom_mgr.py Outdated
dom_info_update_periodic_secs = self.DOM_INFO_UPDATE_PERIOD_SECS

# Wait for all PORTs to be initialized
self.wait_port_initialization(dom_info_update_periodic_secs)

Copilot AI May 13, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider implementing a timeout mechanism in wait_port_initialization() to prevent the possibility of an infinite wait if a port never reaches a terminal state.

Suggested change
self.wait_port_initialization(dom_info_update_periodic_secs)
self.wait_port_initialization(dom_info_update_periodic_secs, timeout=300) # Timeout set to 5 minutes

Copilot uses AI. Check for mistakes.

@mihirpat1 mihirpat1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can you please help in fixing the build failure?

Comment thread sonic-xcvrd/xcvrd/dom/dom_mgr.py Outdated
continue

physical_port = physical_port_list[0]
if not xcvrd._wrapper_get_presence(physical_port):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can we instead call xcvrd_utils.get_transceiver_presence()?
Also, can we check if self.task_stopping_event.is_set()?

Wondering if you want to use _validate_and_get_physical_port instead?

Comment thread sonic-xcvrd/xcvrd/dom/dom_mgr.py Outdated

# Adding dom_info_update_periodic_secs to allow xcvrd to initialize ports
# before starting the periodic update
next_periodic_db_update_time = datetime.datetime.now() + datetime.timedelta(seconds=dom_info_update_periodic_secs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Wondering if we should set is_periodic_db_update_needed to True since all the ports would have been in CMIS terminal state by this point?

Comment thread sonic-xcvrd/xcvrd/dom/dom_mgr.py Outdated
def wait_port_initialization(self, delay):
logical_port_set = set(self.port_mapping.logical_port_list)

while logical_port_set:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can we add an upper bound timeout here to ensure that we don't end up in infinite loop (ideally, this should not occur)?

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mihirpat1 mihirpat1 requested a review from Copilot May 29, 2025 20:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR delays DOM polling until all ports are initialized, aiming to reduce contention during port initialization.

  • Introduces a waiting function to poll for port initialization using defined timeout and polling constants.
  • Updates task_worker to log an error if ports remain uninitialized past the timeout, or to start DOM monitoring once all ports are ready.

Comment thread sonic-xcvrd/xcvrd/dom/dom_mgr.py Outdated
if datetime.datetime.now() > dom_wait_time_end:
break

return logical_port_set

Copilot AI May 29, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return statement is placed inside the while loop, causing an early exit after the first iteration. Consider moving the return statement outside the loop to allow full polling until either a timeout occurs or all ports are initialized.

Suggested change
return logical_port_set
return logical_port_set

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can you please address this?

@mihirpat1 mihirpat1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can you please fix the build failure?

self.log_notice("Stop event generated during DOM monitoring loop")
break

if not is_periodic_db_update_needed:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Why is this removed?


# Start loop to update dom info in DB periodically and handle port change events
while not self.task_stopping_event.is_set():
# Check if periodic db update is needed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Why is this removed?

else:
self.log_notice("All ports are in CMIS terminal state, start DOM monitoring")

# Start loop to update dom info in DB periodically and handle port change events

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Please add is_periodic_db_update_needed = True to ensure that DOM data is updated periodically

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Prince George <prgeor@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants