[tacacs]: Fix ssh_run_command hang on large Paramiko SSH output#25514
Open
Verma-Anukul wants to merge 1 commit into
Open
[tacacs]: Fix ssh_run_command hang on large Paramiko SSH output#25514Verma-Anukul wants to merge 1 commit into
Verma-Anukul wants to merge 1 commit into
Conversation
* Poll channel completion with wait_until and enforce TIMEOUT_LIMIT * Drain stdout/stderr via recv_ready loops to avoid Paramiko deadlock * Keep exec_command(timeout=) for channel-level I/O timeout * Return decoded StringIO streams for readlines() callers * Raise AuthenticationException after ssh_connect_remote_retry exhaustion Fixes sonic-net#25512 Signed-off-by: Anukul Verma <anukverm@cisco.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
Author
|
@wangxin @arlakshm @roy-sror @yxieca @StormLiangMS @qiluo-msft Please help in review and merge |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary: Fix TACACS ssh_run_command() hang when Paramiko SSH commands produce large output (e.g. sudo zgrep pfcwd /var/log/syslog* on a DUT with thousands of rotated syslog files).
Enforce TIMEOUT_LIMIT as a real wall-clock timeout, drain stdout/stderr to avoid channel deadlock, and fail clearly instead of blocking indefinitely.
Fixes # #25512
Type of change
Back port request
Approach
What is the motivation for this PR?
TACACS tests use ssh_run_command() in tests/tacacs/utils.py to run commands on the DUT over Paramiko SSH. The helper sets TIMEOUT_LIMIT = 120 but does not enforce a wall-clock limit on command execution. It calls stdout.channel.recv_exit_status(), which blocks until the remote command exits.
Two problems combine on real testbeds:
No execution timeout — exec_command(timeout=120) only bounds initial channel setup, not command runtime (paramiko#1815).
Paramiko channel deadlock — Without draining stdout/stderr while waiting, large output (e.g. ~3.5 MB from zgrep over 2300+ files) fills the SSH channel window. The remote process blocks on write, the client blocks on recv_exit_status(), and the test hangs until Jenkins or an operator kills it.
This was seen in full regression on test_tacacs_authorization_wildcard when the DUT had 1000+ rotated syslog files.
How did you do it?
Updated ssh_run_command() in tests/tacacs/utils.py:
Call exec_command(command, timeout=timeout) for channel-level I/O timeout.
Use a single time.monotonic() deadline shared across the whole command.
Poll with wait_until() and a command_complete() callback that:
Drains stdout/stderr in while recv_ready() / while recv_stderr_ready() loops each poll (not one chunk per second).
Collects exit status when exit_status_ready(), then keeps draining until both streams are idle.
Avoid blocking stdout.read() / stderr.read() after exit status.
On timeout: channel.close() and pytest.fail() with a clear message.
Return decoded io.StringIO streams so readlines() callers (e.g. check_ssh_output_any_of()) still work.
Raise AuthenticationException from ssh_connect_remote_retry() after retries instead of returning None.
How did you verify/test it?
Repro: On mth-t0-64, created 2300 extra rotated syslog files (syslog.100.gz … syslog.2399.gz). Without the fix, pytest tacacs/test_authorization.py::test_tacacs_authorization_wildcard hung indefinitely on sudo zgrep pfcwd /var/log/syslog*.
With fix: All 13 tests in tacacs/test_authorization.py passed; the wildcard zgrep case completed in ~8s on the DUT (hang was in Paramiko I/O, not slow DUT execution).
Timeout behavior: Slow or stuck commands now fail at 120s with Command timed out after 120s: instead of hanging the job.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation