Handle CMIS_INSERTED state to wait for module state machine transitions to a terminal state by arpit-nexthop · Pull Request #813 · sonic-net/sonic-platform-daemons

arpit-nexthop · 2026-05-08T18:24:20Z

Description

Module should start from a non transient state before processing in the CMIS_INSERTED state.

Motivation and Context

If we reach INSERTED state with module in a transient state, we end up in scenario where we retry operations with incorrect timeout.

Scenario:

Datapath is being deactivated (timeout 5 seconds)
We get host_tx_ready flag as false
Move to INSERTED state, but lost the operation.
Force cmis_reinit with retry count increment (waiting no time since we lost the context of the 5 seconds)
Exhaust all retries and link stay down

How Has This Been Tested?

Config reload results in DPDeinit, during the time when we get a host_tx_ready false state. This moves the state machine back to inserted but loses the context of wait time.

Tested against config reloads, reboots and shut/start on NH-5010

Additional Information (Optional)

Which release branch to port

master
202511
202505

…ns to a terminal state Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mssonicbld · 2026-05-08T18:24:28Z

/azp run

azure-pipelines · 2026-05-08T18:24:38Z

Azure Pipelines successfully started running 1 pipeline(s).

linux-foundation-easycla · 2026-05-08T22:50:18Z

The committers listed above are authorized under a signed CLA.

✅ login: arpit-nexthop / name: arpit-nexthop (5259100, 677ee90, 69c0d8c, 8f01efa, c8fcca9)

mssonicbld · 2026-05-08T22:50:18Z

/azp run

azure-pipelines · 2026-05-08T22:50:27Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-05-08T22:52:27Z

/azp run

azure-pipelines · 2026-05-08T22:52:36Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mssonicbld · 2026-05-08T22:56:55Z

/azp run

azure-pipelines · 2026-05-08T22:57:09Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-05-08T23:28:31Z

/azp run

azure-pipelines · 2026-05-08T23:28:40Z

Azure Pipelines successfully started running 1 pipeline(s).

…settle Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mssonicbld · 2026-05-08T23:32:59Z

/azp run

azure-pipelines · 2026-05-08T23:33:07Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR updates sonic-xcvrd’s CMIS state machine handling so that when entering CMIS_STATE_INSERTED, xcvrd can pause if the module datapath is in a transient transition (e.g., DataPathInit/DataPathDeinit) and only proceed once it reaches a non-transient (terminal) datapath state, avoiding lost operation context and incorrect retry timing after restarts/config reloads.

Changes:

Track a per-port dp_settle_deadline and add logic to wait in CMIS_STATE_INSERTED while datapath is transient, forcing a CMIS reinit on timeout.
Reset dp_settle_deadline on force_cmis_reinit().
Add unit tests for transient datapath detection and settling behavior; update existing worker tests’ mocked datapath state sequences.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py`	Adds transient datapath detection + INSERTED-state “settle wait” using a per-port deadline, and resets this state on reinit.
`sonic-xcvrd/tests/test_xcvrd.py`	Adds/extends tests to cover the new settle-wait logic and adjusts datapath state mock sequences for worker tests.

mssonicbld · 2026-05-09T03:00:34Z

/azp run

azure-pipelines · 2026-05-09T03:00:51Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mssonicbld · 2026-05-11T18:26:59Z

/azp run

arpit-nexthop · 2026-05-11T18:27:01Z

/azp run

azure-pipelines · 2026-05-11T18:27:07Z

Commenter does not have sufficient privileges for PR 813 in repo sonic-net/sonic-platform-daemons

azure-pipelines · 2026-05-11T18:27:10Z

Azure Pipelines successfully started running 1 pipeline(s).

…odule Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mssonicbld · 2026-05-12T17:00:45Z

/azp run

azure-pipelines · 2026-05-12T17:00:55Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mssonicbld · 2026-05-12T17:52:36Z

/azp run

azure-pipelines · 2026-05-12T17:52:46Z

Azure Pipelines successfully started running 1 pipeline(s).

mihirpat1

A few additional observations beyond the existing comments:

cmis_manager_task.py

Timeout path consumes a CMIS retry. On dp-settle timeout you call force_cmis_reinit(lport, retries + 1). That counts the settle wait as a real CMIS retry against CMIS_MAX_RETRIES = 3. Since the motivation of this PR is to avoid burning retries with incorrect timing, consider either keeping the retry count unchanged (force_cmis_reinit(lport, retries)) or documenting why the increment is intended. Otherwise repeated transient catches at boot can exhaust retries quickly.
except Exception is too broad in get_transient_datapath_state. Other CMIS helpers in this file generally let exceptions propagate or catch the specific (AttributeError, NotImplementedError) pair. Swallowing all exceptions here silently returns None, which is indistinguishable from "no transient state" — the caller will then proceed to reconfigure even though the read failed. Narrow the except and/or surface the failure (e.g., treat read failure as transient + short retry, or return a distinct sentinel).
Full-duration deadline is conservative. get_cmis_dp_deinit_duration_secs / get_cmis_dp_init_duration_secs return the full transition time, but the module may already be partway through when we observe it. That's safe (worst case we wait the whole window) but worth a one-line comment, and consider logging the observed initial state so field debug is easier.
No log when the wait resolves naturally. Today there's a "waiting up to Xs" entry and a "timeout, forcing reinit" entry, but nothing for the success path. A log_notice("{}: datapath settled (DP state={})") when clearing dp_settle_deadline would make traces much easier to read.
get_transient_datapath_state reads api.get_datapath_state() on every call. With the per-port loop this is cheap, but please double-check it does not race with the same call that follows in subsequent INSERTED-state processing (could be the same EEPROM access twice per cycle on every port).
Lifecycle of dp_settle_deadline is one-shot per INSERTED entry — please add a docstring note stating that on xcvrd restart the deadline is intentionally recomputed from scratch (since the PR description targets exactly that restart scenario, that contract is worth pinning down).

tests/test_xcvrd.py

test_CmisManagerTask_should_wait_for_dp_settle is monolithic. It chains 5 sub-scenarios with shared mutable state (task.port_dict['Ethernet0'], task_stopping_event). A failure in one sub-test leaves the rest in an unpredictable state and pytest can't tell which scenario failed. Splitting into 5 separate def test_* methods (or using pytest.mark.parametrize) would make regressions much easier to diagnose.
Hard-coded time.time patches assume monotonic ordering. The deadline assertion dp_settle_deadline == 1600.0 relies on a single time.time() call inside should_wait_for_dp_settle. If anyone adds another time.time() reference, the test silently breaks. Consider patching once with a stateful side_effect that advances on each call, like the scheduling test does in #758.
The new mock_xcvr_api.get_datapath_state side_effect lists added to three existing worker tests just prepend one more "all-Deactivated" entry. Worth a brief comment in each test (# extra entry consumed by new dp-settle check in INSERTED state) so the next maintainer doesn't think it's a typo.

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mssonicbld · 2026-06-03T14:29:05Z

/azp run

azure-pipelines · 2026-06-03T14:29:17Z

Azure Pipelines successfully started running 1 pipeline(s).

arpit-nexthop · 2026-06-03T14:30:11Z

@mihirpat1, for the final set of comments.

Timeout path consumes a CMIS retry

Keeping retries + 1 here is intentional. Otherwise module with genuine stuck issues will keep retrying forever

except Exception is too broad in get_transient_datapath_state

Good point — narrowed to (AttributeError, NotImplementedError) to match the convention used by the other CMIS helpers in this file. Unexpected I/O errors will now propagate instead of being silently treated as "no transient state".

Full-duration deadline is conservative

Added a one-line comment explaining that we deliberately budget for the full transition window — worst case we wait the entire dp_deinit/dp_init duration, never less. The existing log_notice already includes the observed DP state, so field debug has that signal.

No log when the wait resolves naturally

Added a log_notice("…: datapath settled, clearing dp_settle_deadline") on the success branch, gated so it only fires when a deadline was actually in progress (avoids spamming on the common "no transient" path).

get_transient_datapath_state reads api.get_datapath_state() on every call

Double-checked: should_wait_for_dp_settle is the only caller, and no other code path inside handle_cmis_inserted_state (or any downstream INSERTED-state processing in this cycle) reads get_datapath_state. So it's at most one EEPROM read per port per cycle while INSERTED — no duplicate access in the same cycle.

Lifecycle of dp_settle_deadline is one-shot per INSERTED entry

Expanded the should_wait_for_dp_settle docstring to explicitly state that the deadline is recomputed from scratch on each INSERTED entry (including after xcvrd restart), and that on restart the recompute is the whole point of this wait — xcvrd has no memory of how far the in-flight transition had progressed.

test_CmisManagerTask_should_wait_for_dp_settle is monolithic

Split into 5 standalone tests, one per sub-scenario:

…_not_transient_clears_deadline
…_first_deinit_sets_deadline
…_first_init_sets_deadline
…_within_deadline_still_waiting
…_past_deadline_forces_reinit

They share a tiny _make_dp_settle_task helper for the common task/api/mask setup so each test stays focused on its single assertion.

Hard-coded time.time patches assume monotonic ordering

After the split in (7) each test now exercises only a single time.time() call path, so the original concern (a future time.time() reference silently breaking the test) no longer applies — there is only one call per test to patch. Leaving the simple return_value patches in place rather than introducing a stateful side_effect that nothing currently needs.

The new mock_xcvr_api.get_datapath_state side_effect lists

Added a comment above the prepended entry in each of the three worker tests so the next maintainer can see it's the read consumed by the new dp-settle check in INSERTED state, not a typo.

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mssonicbld · 2026-06-03T15:17:43Z

/azp run

azure-pipelines · 2026-06-03T15:17:53Z

Azure Pipelines successfully started running 1 pipeline(s).

arpit-nexthop · 2026-06-04T14:25:26Z

@prgeor @mihirpat1 All comments are addressed. Please review and approve.

Handle CMIS_INSERTED state to wait for module state machine transitio…

69c0d8c

…ns to a terminal state Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

arpit-nexthop force-pushed the handle-cmis-inserted-state-to-wait branch from 536c615 to 3e30f34 Compare May 8, 2026 22:52

Fix CMIS task_worker tests for new should_wait_for_dp_settle call

677ee90

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

arpit-nexthop force-pushed the handle-cmis-inserted-state-to-wait branch from 3e30f34 to 677ee90 Compare May 8, 2026 22:56

Add coverage for get_transient_datapath_state and should_wait_for_dp_…

5259100

…settle Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

arpit-nexthop force-pushed the handle-cmis-inserted-state-to-wait branch from 9dfb794 to 5259100 Compare May 8, 2026 23:32

mihirpat1 requested a review from Copilot May 9, 2026 00:51

Copilot started reviewing on behalf of mihirpat1 May 9, 2026 00:51 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py Outdated

Avoid unnecessary call to the API

c8fcca9

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

arpit-nexthop force-pushed the handle-cmis-inserted-state-to-wait branch from d0537ec to c8fcca9 Compare May 11, 2026 16:39

mihirpat1 reviewed May 11, 2026

View reviewed changes

Comment thread sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py Outdated

Comment thread sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py

arpit-nexthop added 2 commits May 11, 2026 23:55

Changed the wait function to work on datapath instead of the entire m…

8f01efa

…odule Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

Merge branch 'sonic-net:master' into handle-cmis-inserted-state-to-wait

e93b408

Fix the error message from warning to error

c054413

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mihirpat1 reviewed Jun 2, 2026

View reviewed changes

mihirpat1 previously approved these changes Jun 2, 2026

View reviewed changes

Fix the review comments

0f16023

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

arpit-nexthop dismissed mihirpat1’s stale review via 0f16023 June 3, 2026 14:28

Update test to reflect exception

613b9ec

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>

mihirpat1 approved these changes Jun 4, 2026

View reviewed changes

Conversation

arpit-nexthop commented May 8, 2026

Description

Motivation and Context

How Has This Been Tested?

Additional Information (Optional)

Which release branch to port

Uh oh!

mssonicbld commented May 8, 2026

Uh oh!

azure-pipelines Bot commented May 8, 2026

Uh oh!

linux-foundation-easycla Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mssonicbld commented May 8, 2026

Uh oh!

azure-pipelines Bot commented May 8, 2026

Uh oh!

mssonicbld commented May 8, 2026

Uh oh!

azure-pipelines Bot commented May 8, 2026

Uh oh!

mssonicbld commented May 8, 2026

Uh oh!

azure-pipelines Bot commented May 8, 2026

Uh oh!

mssonicbld commented May 8, 2026

Uh oh!

azure-pipelines Bot commented May 8, 2026

Uh oh!

mssonicbld commented May 8, 2026

Uh oh!

azure-pipelines Bot commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

mssonicbld commented May 9, 2026

Uh oh!

azure-pipelines Bot commented May 9, 2026

Uh oh!

mssonicbld commented May 11, 2026

Uh oh!

arpit-nexthop commented May 11, 2026

Uh oh!

azure-pipelines Bot commented May 11, 2026

Uh oh!

azure-pipelines Bot commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented May 12, 2026

Uh oh!

azure-pipelines Bot commented May 12, 2026

Uh oh!

mssonicbld commented May 12, 2026

Uh oh!

azure-pipelines Bot commented May 12, 2026

Uh oh!

mihirpat1 left a comment

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Jun 3, 2026

Uh oh!

azure-pipelines Bot commented Jun 3, 2026

Uh oh!

arpit-nexthop commented Jun 3, 2026

Uh oh!

mssonicbld commented Jun 3, 2026

Uh oh!

azure-pipelines Bot commented Jun 3, 2026

Uh oh!

arpit-nexthop commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

linux-foundation-easycla Bot commented May 8, 2026 •

edited

Loading