Skip to content

[tacacs]: Fix ssh_run_command hang on large Paramiko SSH output#25514

Open
Verma-Anukul wants to merge 1 commit into
sonic-net:masterfrom
Verma-Anukul:tacacs-ssh-cmd-timeout-fix
Open

[tacacs]: Fix ssh_run_command hang on large Paramiko SSH output#25514
Verma-Anukul wants to merge 1 commit into
sonic-net:masterfrom
Verma-Anukul:tacacs-ssh-cmd-timeout-fix

Conversation

@Verma-Anukul

Copy link
Copy Markdown
Contributor

Description of PR

Summary: Fix TACACS ssh_run_command() hang when Paramiko SSH commands produce large output (e.g. sudo zgrep pfcwd /var/log/syslog* on a DUT with thousands of rotated syslog files).
Enforce TIMEOUT_LIMIT as a real wall-clock timeout, drain stdout/stderr to avoid channel deadlock, and fail clearly instead of blocking indefinitely.

Fixes # #25512

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202311
  • 202405
  • 202411
  • 202505
  • 202511
  • 202512
  • 202605

Approach

What is the motivation for this PR?

TACACS tests use ssh_run_command() in tests/tacacs/utils.py to run commands on the DUT over Paramiko SSH. The helper sets TIMEOUT_LIMIT = 120 but does not enforce a wall-clock limit on command execution. It calls stdout.channel.recv_exit_status(), which blocks until the remote command exits.

Two problems combine on real testbeds:

No execution timeout — exec_command(timeout=120) only bounds initial channel setup, not command runtime (paramiko#1815).
Paramiko channel deadlock — Without draining stdout/stderr while waiting, large output (e.g. ~3.5 MB from zgrep over 2300+ files) fills the SSH channel window. The remote process blocks on write, the client blocks on recv_exit_status(), and the test hangs until Jenkins or an operator kills it.
This was seen in full regression on test_tacacs_authorization_wildcard when the DUT had 1000+ rotated syslog files.

How did you do it?

Updated ssh_run_command() in tests/tacacs/utils.py:

Call exec_command(command, timeout=timeout) for channel-level I/O timeout.
Use a single time.monotonic() deadline shared across the whole command.
Poll with wait_until() and a command_complete() callback that:
Drains stdout/stderr in while recv_ready() / while recv_stderr_ready() loops each poll (not one chunk per second).
Collects exit status when exit_status_ready(), then keeps draining until both streams are idle.
Avoid blocking stdout.read() / stderr.read() after exit status.
On timeout: channel.close() and pytest.fail() with a clear message.
Return decoded io.StringIO streams so readlines() callers (e.g. check_ssh_output_any_of()) still work.
Raise AuthenticationException from ssh_connect_remote_retry() after retries instead of returning None.

How did you verify/test it?

Repro: On mth-t0-64, created 2300 extra rotated syslog files (syslog.100.gz … syslog.2399.gz). Without the fix, pytest tacacs/test_authorization.py::test_tacacs_authorization_wildcard hung indefinitely on sudo zgrep pfcwd /var/log/syslog*.

With fix: All 13 tests in tacacs/test_authorization.py passed; the wildcard zgrep case completed in ~8s on the DUT (hang was in Paramiko I/O, not slow DUT execution).

Timeout behavior: Slow or stuck commands now fail at 120s with Command timed out after 120s: instead of hanging the job.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

* Poll channel completion with wait_until and enforce TIMEOUT_LIMIT
* Drain stdout/stderr via recv_ready loops to avoid Paramiko deadlock
* Keep exec_command(timeout=) for channel-level I/O timeout
* Return decoded StringIO streams for readlines() callers
* Raise AuthenticationException after ssh_connect_remote_retry exhaustion

Fixes sonic-net#25512

Signed-off-by: Anukul Verma <anukverm@cisco.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@Verma-Anukul

Copy link
Copy Markdown
Contributor Author

@wangxin @arlakshm @roy-sror @yxieca @StormLiangMS @qiluo-msft

Please help in review and merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants