Skip to content

Handle CMIS_INSERTED state to wait for module state machine transitions to a terminal state#813

Open
arpit-nexthop wants to merge 9 commits into
sonic-net:masterfrom
nexthop-ai:handle-cmis-inserted-state-to-wait
Open

Handle CMIS_INSERTED state to wait for module state machine transitions to a terminal state#813
arpit-nexthop wants to merge 9 commits into
sonic-net:masterfrom
nexthop-ai:handle-cmis-inserted-state-to-wait

Conversation

@arpit-nexthop

Copy link
Copy Markdown
Contributor

Description

Module should start from a non transient state before processing in the CMIS_INSERTED state.

Motivation and Context

If we reach INSERTED state with module in a transient state, we end up in scenario where we retry operations with incorrect timeout.

Scenario:

Datapath is being deactivated (timeout 5 seconds)
We get host_tx_ready flag as false
Move to INSERTED state, but lost the operation.
Force cmis_reinit with retry count increment (waiting no time since we lost the context of the 5 seconds)
Exhaust all retries and link stay down

How Has This Been Tested?

Config reload results in DPDeinit, during the time when we get a host_tx_ready false state. This moves the state machine back to inserted but loses the context of wait time.

Tested against config reloads, reboots and shut/start on NH-5010

Additional Information (Optional)

Which release branch to port

  • master
  • 202511
  • 202505

…ns to a terminal state

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@linux-foundation-easycla

linux-foundation-easycla Bot commented May 8, 2026

Copy link
Copy Markdown

CLA Signed

The committers listed above are authorized under a signed CLA.

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@arpit-nexthop arpit-nexthop force-pushed the handle-cmis-inserted-state-to-wait branch from 536c615 to 3e30f34 Compare May 8, 2026 22:52
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
@arpit-nexthop arpit-nexthop force-pushed the handle-cmis-inserted-state-to-wait branch from 3e30f34 to 677ee90 Compare May 8, 2026 22:56
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

…settle

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
@arpit-nexthop arpit-nexthop force-pushed the handle-cmis-inserted-state-to-wait branch from 9dfb794 to 5259100 Compare May 8, 2026 23:32
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates sonic-xcvrd’s CMIS state machine handling so that when entering CMIS_STATE_INSERTED, xcvrd can pause if the module datapath is in a transient transition (e.g., DataPathInit/DataPathDeinit) and only proceed once it reaches a non-transient (terminal) datapath state, avoiding lost operation context and incorrect retry timing after restarts/config reloads.

Changes:

  • Track a per-port dp_settle_deadline and add logic to wait in CMIS_STATE_INSERTED while datapath is transient, forcing a CMIS reinit on timeout.
  • Reset dp_settle_deadline on force_cmis_reinit().
  • Add unit tests for transient datapath detection and settling behavior; update existing worker tests’ mocked datapath state sequences.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py Adds transient datapath detection + INSERTED-state “settle wait” using a per-port deadline, and resets this state on reinit.
sonic-xcvrd/tests/test_xcvrd.py Adds/extends tests to cover the new settle-wait logic and adjusts datapath state mock sequences for worker tests.

Comment thread sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py Outdated
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
@arpit-nexthop arpit-nexthop force-pushed the handle-cmis-inserted-state-to-wait branch from d0537ec to c8fcca9 Compare May 11, 2026 16:39
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

1 similar comment
@arpit-nexthop

Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Commenter does not have sufficient privileges for PR 813 in repo sonic-net/sonic-platform-daemons

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py Outdated
Comment thread sonic-xcvrd/xcvrd/cmis/cmis_manager_task.py
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mihirpat1 mihirpat1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few additional observations beyond the existing comments:

cmis_manager_task.py

  1. Timeout path consumes a CMIS retry. On dp-settle timeout you call force_cmis_reinit(lport, retries + 1). That counts the settle wait as a real CMIS retry against CMIS_MAX_RETRIES = 3. Since the motivation of this PR is to avoid burning retries with incorrect timing, consider either keeping the retry count unchanged (force_cmis_reinit(lport, retries)) or documenting why the increment is intended. Otherwise repeated transient catches at boot can exhaust retries quickly.

  2. except Exception is too broad in get_transient_datapath_state. Other CMIS helpers in this file generally let exceptions propagate or catch the specific (AttributeError, NotImplementedError) pair. Swallowing all exceptions here silently returns None, which is indistinguishable from "no transient state" — the caller will then proceed to reconfigure even though the read failed. Narrow the except and/or surface the failure (e.g., treat read failure as transient + short retry, or return a distinct sentinel).

  3. Full-duration deadline is conservative. get_cmis_dp_deinit_duration_secs / get_cmis_dp_init_duration_secs return the full transition time, but the module may already be partway through when we observe it. That's safe (worst case we wait the whole window) but worth a one-line comment, and consider logging the observed initial state so field debug is easier.

  4. No log when the wait resolves naturally. Today there's a "waiting up to Xs" entry and a "timeout, forcing reinit" entry, but nothing for the success path. A log_notice("{}: datapath settled (DP state={})") when clearing dp_settle_deadline would make traces much easier to read.

  5. get_transient_datapath_state reads api.get_datapath_state() on every call. With the per-port loop this is cheap, but please double-check it does not race with the same call that follows in subsequent INSERTED-state processing (could be the same EEPROM access twice per cycle on every port).

  6. Lifecycle of dp_settle_deadline is one-shot per INSERTED entry — please add a docstring note stating that on xcvrd restart the deadline is intentionally recomputed from scratch (since the PR description targets exactly that restart scenario, that contract is worth pinning down).

tests/test_xcvrd.py

  1. test_CmisManagerTask_should_wait_for_dp_settle is monolithic. It chains 5 sub-scenarios with shared mutable state (task.port_dict['Ethernet0'], task_stopping_event). A failure in one sub-test leaves the rest in an unpredictable state and pytest can't tell which scenario failed. Splitting into 5 separate def test_* methods (or using pytest.mark.parametrize) would make regressions much easier to diagnose.

  2. Hard-coded time.time patches assume monotonic ordering. The deadline assertion dp_settle_deadline == 1600.0 relies on a single time.time() call inside should_wait_for_dp_settle. If anyone adds another time.time() reference, the test silently breaks. Consider patching once with a stateful side_effect that advances on each call, like the scheduling test does in #758.

  3. The new mock_xcvr_api.get_datapath_state side_effect lists added to three existing worker tests just prepend one more "all-Deactivated" entry. Worth a brief comment in each test (# extra entry consumed by new dp-settle check in INSERTED state) so the next maintainer doesn't think it's a typo.

mihirpat1
mihirpat1 previously approved these changes Jun 2, 2026
Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@arpit-nexthop

Copy link
Copy Markdown
Contributor Author

@mihirpat1, for the final set of comments.

  1. Timeout path consumes a CMIS retry

Keeping retries + 1 here is intentional. Otherwise module with genuine stuck issues will keep retrying forever

  1. except Exception is too broad in get_transient_datapath_state

Good point — narrowed to (AttributeError, NotImplementedError) to match the convention used by the other CMIS helpers in this file. Unexpected I/O errors will now propagate instead of being silently treated as "no transient state".

  1. Full-duration deadline is conservative

Added a one-line comment explaining that we deliberately budget for the full transition window — worst case we wait the entire dp_deinit/dp_init duration, never less. The existing log_notice already includes the observed DP state, so field debug has that signal.

  1. No log when the wait resolves naturally

Added a log_notice("…: datapath settled, clearing dp_settle_deadline") on the success branch, gated so it only fires when a deadline was actually in progress (avoids spamming on the common "no transient" path).

  1. get_transient_datapath_state reads api.get_datapath_state() on every call

Double-checked: should_wait_for_dp_settle is the only caller, and no other code path inside handle_cmis_inserted_state (or any downstream INSERTED-state processing in this cycle) reads get_datapath_state. So it's at most one EEPROM read per port per cycle while INSERTED — no duplicate access in the same cycle.

  1. Lifecycle of dp_settle_deadline is one-shot per INSERTED entry

Expanded the should_wait_for_dp_settle docstring to explicitly state that the deadline is recomputed from scratch on each INSERTED entry (including after xcvrd restart), and that on restart the recompute is the whole point of this wait — xcvrd has no memory of how far the in-flight transition had progressed.

  1. test_CmisManagerTask_should_wait_for_dp_settle is monolithic

Split into 5 standalone tests, one per sub-scenario:

  • …_not_transient_clears_deadline
  • …_first_deinit_sets_deadline
  • …_first_init_sets_deadline
  • …_within_deadline_still_waiting
  • …_past_deadline_forces_reinit

They share a tiny _make_dp_settle_task helper for the common task/api/mask setup so each test stays focused on its single assertion.

  1. Hard-coded time.time patches assume monotonic ordering

After the split in (7) each test now exercises only a single time.time() call path, so the original concern (a future time.time() reference silently breaking the test) no longer applies — there is only one call per test to patch. Leaving the simple return_value patches in place rather than introducing a stateful side_effect that nothing currently needs.

  1. The new mock_xcvr_api.get_datapath_state side_effect lists

Added a comment above the prepended entry in each of the three worker tests so the next maintainer can see it's the read consumed by the new dp-settle check in INSERTED state, not a typo.

Signed-off-by: arpit-nexthop <arpit@nexthop.ai>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@arpit-nexthop

Copy link
Copy Markdown
Contributor Author

@prgeor @mihirpat1 All comments are addressed. Please review and approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants