Skip to content

snappi: Assert DUT peer ports are operationally up before PFCwd basic test#25331

Open
ediwibowo-msft wants to merge 1 commit into
sonic-net:masterfrom
ediwibowo-msft:fix/pfcwd_basic_helper
Open

snappi: Assert DUT peer ports are operationally up before PFCwd basic test#25331
ediwibowo-msft wants to merge 1 commit into
sonic-net:masterfrom
ediwibowo-msft:fix/pfcwd_basic_helper

Conversation

@ediwibowo-msft

@ediwibowo-msft ediwibowo-msft commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary:
Assert that the DUT-side ports for both the ingress and egress DUTs are operationally up before running the PFC watchdog basic test. Without this guard, run_pfcwd_basic_test proceeds even when one of the DUT ports is still down.

The new check uses wait_until(30, 2, 0, host.is_interface_status_up, port) so a transiently-down link gets up to 30s to recover before the test is failed with a clear, actionable message naming the offending host and port.

Fixes # #25330

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202311
  • 202405
  • 202411
  • 202505
  • 202511
  • 202512
  • 202605

Approach

What is the motivation for this PR?

test_pfcwd_basic_*_lossless_prio runs were intermittently failing on multi-DUT topologies with Loss rate of Data Flow 1 (0.0) should be in [0.7, 1]. Inspection of the run logs showed PFCwd drop/storm-detected counters identical before and after the traffic run — i.e. PFCwd never engaged because at least one peer port wasn't actually up when traffic started. The current helper has no precondition check on link state, so the failure surfaces as a confusing TGEN loss-rate mismatch instead of a clear "port not up" error.

How did you do it?

In tests/snappi_tests/pfcwd/files/pfcwd_basic_helper.py::run_pfcwd_basic_test, immediately after start_pfcwd is invoked on both DUTs and before initial PFCwd stats are sampled, iterate over both (duthost, peer_port) pairs and assert each is operationally up.

image

How did you verify/test it?

  • Ran test_pfcwd_basic_single_lossless_prio and test_pfcwd_basic_multi_lossless_prio. Tests pass with both DUT ports up.
  • Negative path: manually config interface shutdown of one of the peer ports before launching the test; the test now fails fast (within ~30s) with Port EthernetX on <host> is not operationally up instead of running ~minutes of traffic and failing with a misleading loss-rate assertion.

Any platform specific information?

None. The check uses generic SONiC CLI status (is_interface_status_up) and applies to all platforms running this snappi PFCwd basic helper. Multi-ASIC and single-ASIC paths are both covered.

Supported testbed topology if it's a new test case?

N/A

Documentation

N/A — no behavioral or interface change requiring documentation updates.

… test

Signed-off-by: Edi Wibowo <ediwibowo@microsoft.com>
@ediwibowo-msft ediwibowo-msft self-assigned this Jun 12, 2026
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld mssonicbld added the Request for 202511 branch Request to backport a change to 202511 branch label Jun 12, 2026
@mssonicbld

Copy link
Copy Markdown
Collaborator

This PR has backport request for branch(es): 202511.
Added label(s) for branch(es) 202511.

---Powered by SONiC BuildBot

@rraghav-cisco rraghav-cisco self-requested a review June 16, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Request for 202511 branch Request to backport a change to 202511 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants