Handle CMIS_INSERTED state to wait for module state machine transitions to a terminal state#813
Conversation
…ns to a terminal state Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
536c615 to
3e30f34
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
3e30f34 to
677ee90
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…settle Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
9dfb794 to
5259100
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR updates sonic-xcvrd’s CMIS state machine handling so that when entering CMIS_STATE_INSERTED, xcvrd can pause if the module datapath is in a transient transition (e.g., DataPathInit/DataPathDeinit) and only proceed once it reaches a non-transient (terminal) datapath state, avoiding lost operation context and incorrect retry timing after restarts/config reloads.
Changes:
- Track a per-port
dp_settle_deadlineand add logic to wait inCMIS_STATE_INSERTEDwhile datapath is transient, forcing a CMIS reinit on timeout. - Reset
dp_settle_deadlineonforce_cmis_reinit(). - Add unit tests for transient datapath detection and settling behavior; update existing worker tests’ mocked datapath state sequences.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py |
Adds transient datapath detection + INSERTED-state “settle wait” using a per-port deadline, and resets this state on reinit. |
sonic-xcvrd/tests/test_xcvrd.py |
Adds/extends tests to cover the new settle-wait logic and adjusts datapath state mock sequences for worker tests. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
d0537ec to
c8fcca9
Compare
|
/azp run |
1 similar comment
|
/azp run |
|
Commenter does not have sufficient privileges for PR 813 in repo sonic-net/sonic-platform-daemons |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…odule Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
mihirpat1
left a comment
There was a problem hiding this comment.
A few additional observations beyond the existing comments:
cmis_manager_task.py
-
Timeout path consumes a CMIS retry. On dp-settle timeout you call
force_cmis_reinit(lport, retries + 1). That counts the settle wait as a real CMIS retry againstCMIS_MAX_RETRIES = 3. Since the motivation of this PR is to avoid burning retries with incorrect timing, consider either keeping the retry count unchanged (force_cmis_reinit(lport, retries)) or documenting why the increment is intended. Otherwise repeated transient catches at boot can exhaust retries quickly. -
except Exceptionis too broad inget_transient_datapath_state. Other CMIS helpers in this file generally let exceptions propagate or catch the specific(AttributeError, NotImplementedError)pair. Swallowing all exceptions here silently returnsNone, which is indistinguishable from "no transient state" — the caller will then proceed to reconfigure even though the read failed. Narrow the except and/or surface the failure (e.g., treat read failure as transient + short retry, or return a distinct sentinel). -
Full-duration deadline is conservative.
get_cmis_dp_deinit_duration_secs/get_cmis_dp_init_duration_secsreturn the full transition time, but the module may already be partway through when we observe it. That's safe (worst case we wait the whole window) but worth a one-line comment, and consider logging the observed initial state so field debug is easier. -
No log when the wait resolves naturally. Today there's a "waiting up to Xs" entry and a "timeout, forcing reinit" entry, but nothing for the success path. A
log_notice("{}: datapath settled (DP state={})")when clearingdp_settle_deadlinewould make traces much easier to read. -
get_transient_datapath_statereadsapi.get_datapath_state()on every call. With the per-port loop this is cheap, but please double-check it does not race with the same call that follows in subsequent INSERTED-state processing (could be the same EEPROM access twice per cycle on every port). -
Lifecycle of
dp_settle_deadlineis one-shot per INSERTED entry — please add a docstring note stating that on xcvrd restart the deadline is intentionally recomputed from scratch (since the PR description targets exactly that restart scenario, that contract is worth pinning down).
tests/test_xcvrd.py
-
test_CmisManagerTask_should_wait_for_dp_settleis monolithic. It chains 5 sub-scenarios with shared mutable state (task.port_dict['Ethernet0'],task_stopping_event). A failure in one sub-test leaves the rest in an unpredictable state and pytest can't tell which scenario failed. Splitting into 5 separatedef test_*methods (or usingpytest.mark.parametrize) would make regressions much easier to diagnose. -
Hard-coded
time.timepatches assume monotonic ordering. The deadline assertiondp_settle_deadline == 1600.0relies on a singletime.time()call insideshould_wait_for_dp_settle. If anyone adds anothertime.time()reference, the test silently breaks. Consider patching once with a stateful side_effect that advances on each call, like the scheduling test does in #758. -
The new mock_xcvr_api.get_datapath_state side_effect lists added to three existing worker tests just prepend one more "all-Deactivated" entry. Worth a brief comment in each test (
# extra entry consumed by new dp-settle check in INSERTED state) so the next maintainer doesn't think it's a typo.
Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@mihirpat1, for the final set of comments.
Keeping retries + 1 here is intentional. Otherwise module with genuine stuck issues will keep retrying forever
Good point — narrowed to (AttributeError, NotImplementedError) to match the convention used by the other CMIS helpers in this file. Unexpected I/O errors will now propagate instead of being silently treated as "no transient state".
Added a one-line comment explaining that we deliberately budget for the full transition window — worst case we wait the entire dp_deinit/dp_init duration, never less. The existing log_notice already includes the observed DP state, so field debug has that signal.
Added a log_notice("…: datapath settled, clearing dp_settle_deadline") on the success branch, gated so it only fires when a deadline was actually in progress (avoids spamming on the common "no transient" path).
Double-checked: should_wait_for_dp_settle is the only caller, and no other code path inside handle_cmis_inserted_state (or any downstream INSERTED-state processing in this cycle) reads get_datapath_state. So it's at most one EEPROM read per port per cycle while INSERTED — no duplicate access in the same cycle.
Expanded the should_wait_for_dp_settle docstring to explicitly state that the deadline is recomputed from scratch on each INSERTED entry (including after xcvrd restart), and that on restart the recompute is the whole point of this wait — xcvrd has no memory of how far the in-flight transition had progressed.
Split into 5 standalone tests, one per sub-scenario:
They share a tiny _make_dp_settle_task helper for the common task/api/mask setup so each test stays focused on its single assertion.
After the split in (7) each test now exercises only a single time.time() call path, so the original concern (a future time.time() reference silently breaking the test) no longer applies — there is only one call per test to patch. Leaving the simple return_value patches in place rather than introducing a stateful side_effect that nothing currently needs.
Added a comment above the prepended entry in each of the three worker tests so the next maintainer can see it's the read consumed by the new dp-settle check in INSERTED state, not a typo. |
Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@prgeor @mihirpat1 All comments are addressed. Please review and approve. |
Description
Module should start from a non transient state before processing in the CMIS_INSERTED state.
Motivation and Context
If we reach INSERTED state with module in a transient state, we end up in scenario where we retry operations with incorrect timeout.
Scenario:
Datapath is being deactivated (timeout 5 seconds)
We get host_tx_ready flag as false
Move to INSERTED state, but lost the operation.
Force cmis_reinit with retry count increment (waiting no time since we lost the context of the 5 seconds)
Exhaust all retries and link stay down
How Has This Been Tested?
Config reload results in DPDeinit, during the time when we get a host_tx_ready false state. This moves the state machine back to inserted but loses the context of wait time.
Tested against config reloads, reboots and shut/start on NH-5010
Additional Information (Optional)
Which release branch to port