Delay DOM polling until all ports are initialized#614
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@moshemos @dgsudharsan for awareness |
There was a problem hiding this comment.
Pull Request Overview
This PR delays DOM polling until all ports finish initialization, ensuring that the CMIS datapath state machine is complete before expensive polling begins.
- Introduces a new function wait_port_initialization that loops until all logical ports are either initialized or removed from consideration.
- Calls the new waiting function in the task_worker() method to delay DOM monitoring until all ports reach a terminal CMIS state.
| dom_info_update_periodic_secs = self.DOM_INFO_UPDATE_PERIOD_SECS | ||
|
|
||
| # Wait for all PORTs to be initialized | ||
| self.wait_port_initialization(dom_info_update_periodic_secs) |
There was a problem hiding this comment.
[nitpick] Consider implementing a timeout mechanism in wait_port_initialization() to prevent the possibility of an infinite wait if a port never reaches a terminal state.
| self.wait_port_initialization(dom_info_update_periodic_secs) | |
| self.wait_port_initialization(dom_info_update_periodic_secs, timeout=300) # Timeout set to 5 minutes |
| continue | ||
|
|
||
| physical_port = physical_port_list[0] | ||
| if not xcvrd._wrapper_get_presence(physical_port): |
There was a problem hiding this comment.
@prgeor Can we instead call xcvrd_utils.get_transceiver_presence()?
Also, can we check if self.task_stopping_event.is_set()?
Wondering if you want to use _validate_and_get_physical_port instead?
|
|
||
| # Adding dom_info_update_periodic_secs to allow xcvrd to initialize ports | ||
| # before starting the periodic update | ||
| next_periodic_db_update_time = datetime.datetime.now() + datetime.timedelta(seconds=dom_info_update_periodic_secs) |
There was a problem hiding this comment.
@prgeor Wondering if we should set is_periodic_db_update_needed to True since all the ports would have been in CMIS terminal state by this point?
| def wait_port_initialization(self, delay): | ||
| logical_port_set = set(self.port_mapping.logical_port_list) | ||
|
|
||
| while logical_port_set: |
There was a problem hiding this comment.
@prgeor Can we add an upper bound timeout here to ensure that we don't end up in infinite loop (ideally, this should not occur)?
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull Request Overview
This PR delays DOM polling until all ports are initialized, aiming to reduce contention during port initialization.
- Introduces a waiting function to poll for port initialization using defined timeout and polling constants.
- Updates task_worker to log an error if ports remain uninitialized past the timeout, or to start DOM monitoring once all ports are ready.
| if datetime.datetime.now() > dom_wait_time_end: | ||
| break | ||
|
|
||
| return logical_port_set |
There was a problem hiding this comment.
The return statement is placed inside the while loop, causing an early exit after the first iteration. Consider moving the return statement outside the loop to allow full polling until either a timeout occurs or all ports are initialized.
| return logical_port_set | |
| return logical_port_set |
| self.log_notice("Stop event generated during DOM monitoring loop") | ||
| break | ||
|
|
||
| if not is_periodic_db_update_needed: |
|
|
||
| # Start loop to update dom info in DB periodically and handle port change events | ||
| while not self.task_stopping_event.is_set(): | ||
| # Check if periodic db update is needed |
| else: | ||
| self.log_notice("All ports are in CMIS terminal state, start DOM monitoring") | ||
|
|
||
| # Start loop to update dom info in DB periodically and handle port change events |
There was a problem hiding this comment.
@prgeor Please add is_periodic_db_update_needed = True to ensure that DOM data is updated periodically
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Prince George <prgeor@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Delay DOM polling until all ports are initialized
Motivation and Context
Platforms with large radix like 512 ports may take more time to initialize and complete the CMIS datapath state machine. Since DOM polling is quite expensive on IO bound operation, it can result in contention with CMIS manager task which is initializing the port
How Has This Been Tested?
Tested this on a platform with 512 100G ports with 800G DR8 optics and see a reduction of overall link up time by around 4mins.
Additional Information (Optional)