Skip to content

[pc] Pass ignore_loganalyzer to config_reload in test_retry_count teardown#25488

Open
sakshamkhurana21 wants to merge 1 commit into
sonic-net:masterfrom
sakshamkhurana21:sakkhurana/fix-retry-count-flaky
Open

[pc] Pass ignore_loganalyzer to config_reload in test_retry_count teardown#25488
sakshamkhurana21 wants to merge 1 commit into
sonic-net:masterfrom
sakshamkhurana21:sakkhurana/fix-retry-count-flaky

Conversation

@sakshamkhurana21

Copy link
Copy Markdown
Contributor

Description of PR

Summary: Fix the flaky pc/test_retry_count.py::TestDutRetryCount::test_kill_team_peer_lag_up test.

The test verifies LAG retry count behavior (that LAG stays up for 150s after killing teamd).
The test body passes — the functional behavior is correct. However, the test fails
intermittently on teardown because LogAnalyzer catches transient syslog errors that
occur during config_reload in the config_reload_on_cleanup fixture.

Transient errors during config_reload include:

  • ERR teamd#teamsyncd: Failed to initialize team handler for LAG ... Unable to initialize team socket
  • ERR memory_checker: cgroup memory usage file ... does not exist
  • ERR swss#orchagent: removeLag: Failed to remove ref count

These are expected during container restart — the system retries and recovers automatically.
The config_reload_on_cleanup fixture was not telling LogAnalyzer to expect these transient
errors, so LogAnalyzer was failing the test.

Fixes the intermittent failure observed in Elastictest test plans including:

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202311
  • 202405
  • 202411
  • 202505
  • 202511
  • 202512
  • 202605

Approach

What is the motivation for this PR?

test_kill_team_peer_lag_up is a flaky test. Over the last 14 days: 71 errors / 39 distinct PRs.
The dominant error signatures are:

  • teamsyncd: Failed to initialize team handler (68 occurrences, 96%)
  • memory_checker: cgroup memory usage file does not exist (3 occurrences, 4%)

Both occur during config_reload in the teardown fixture and are transient/self-healing.

How did you do it?

Added loganalyzer as a fixture dependency to config_reload_on_cleanup and passed
ignore_loganalyzer=loganalyzer to the config_reload() call. This tells LogAnalyzer
to add start/end ignore markers around the reload operation, so transient errors during
reload are not captured.

This is the canonical pattern used by 8+ other tests in sonic-mgmt that perform config_reload:

  • tests/route/test_route_perf.py
  • tests/pc/test_lag_member_forwarding.py
  • tests/drop_packets/drop_packets.py
  • tests/wan/lacp/test_wan_lag_min_link.py
  • tests/bgp/test_bgp_suppress_fib.py
  • etc.

How did you verify/test it?

  1. Confirmed transient errors occur during config_reload on dev-VM (vlab-03, t1-lag):
=== ERR messages during config_reload ===
2026 Jun 19 07:12:30 vlab-03 ERR swss#orchagent: :- removeLag: Failed to remove ref count 3 LAG PortChannel102
2026 Jun 19 07:12:30 vlab-03 ERR swss#orchagent: :- removeLag: Failed to remove ref count 3 LAG PortChannel105
... (30+ transient ERR lines during reload)
  1. Verified the fix follows the canonical pattern used by other tests.

  2. Syntax and lint verified: py_compile and flake8 --max-line-length=120 clean.

Any platform specific information?

None

Supported testbed topology if it's a new test case?

N/A

Documentation

N/A

…rdown

Signed-off-by: sakshamkhurana <sakkhurana@microsoft.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent teardown failures in tests/pc/test_retry_count.py by muting LogAnalyzer during config_reload() inside the config_reload_on_cleanup fixture, preventing expected transient syslog ERR messages during reload from failing the test.

Changes:

  • Add loganalyzer as a dependency to the config_reload_on_cleanup fixture.
  • Pass ignore_loganalyzer=loganalyzer to config_reload(..., safe_reload=True) during fixture teardown.

@mssonicbld

Copy link
Copy Markdown
Collaborator

This PR has backport request for branch(es): 202605.
Added label(s) for branch(es) 202605.

---Powered by SONiC BuildBot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants