Fix flaky action server and client tests under CPU-starved CI#50
Open
vik748 wants to merge 1 commit intolocusrobotics:masterfrom
Open
Fix flaky action server and client tests under CPU-starved CI#50vik748 wants to merge 1 commit intolocusrobotics:masterfrom
vik748 wants to merge 1 commit intolocusrobotics:masterfrom
Conversation
The test helpers in test_action_server.py (wait_for_status, wait_for_result) and the reach_status/wait/feedback calls in test_action_client.py all poll or block indefinitely. Under CPU/IO contention on shared CI runners, the ROS spinner thread is starved and cross-thread callbacks (call_soon_threadsafe, run_coroutine_threadsafe) are delayed, causing tests to hang for the full 60s rostest timeout. Add bounded 5s timeouts via asyncio.wait_for to all unbounded waits, and a retry_on_timeout helper that resends goals up to 3 times — mirroring the pattern already used by AsyncActionClient.ensure_goal in production code. Also replace deprecated assertEquals with assertEqual.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix flaky action server and client tests under CPU-starved CI
Problem
test_action_serverandtest_action_clientintermittently hang for the full 60srostesttimeout in CI before being killed withKeyboardInterrupt. This produces misleading errors:During teardown, Python's interpreter cleanup clears builtins before pending asyncio tasks finish logging, producing a secondary
NameError: name 'open' is not defined— a red herring that obscures the real issue.Root cause
Test helpers poll indefinitely for goal status transitions. In
test_action_server.py:In
test_action_client.py, the unbounded calls are togoal_handle.reach_status(),goal_handle.wait(), and thegoal_handle.feedback()async generator — all of which block indefinitely if the expected transition never arrives.Under CPU/IO contention (shared CI runners, parallel builds), the ROS spinner thread is starved and cross-thread callbacks (
loop.call_soon_threadsafeinAsyncActionServer.start(),asyncio.run_coroutine_threadsafein_AsyncGoalHandle._transition_cb) are delayed. The polling loops / condition waits never see the status transition and block untilrostestkills the process.Reproduction
We reproduced the flaky behavior locally using
stress-ngto simulate CI resource contention:Before fix (no timeouts, no retries):
test_goal_succeededhung on run 4 out of 10, blocking for 58s beforerostestkilled it. The rostest log showed all ROS topics were registered but the goal callback never fired on the asyncio event loop — confirming spinner thread starvation as the cause.After adding 5s timeouts only (no retries): Failed on run 24 out of 100 with a clean
asyncio.TimeoutError— fast failure instead of a 60s hang, but still flaky.After adding 5s timeouts + 3 retries to
test_action_server.pyonly: Passed 21 consecutive runs under the same stress load. Run 22 failed, but intest_action_client.pywhich had the same unbounded pattern — confirming both files needed the fix.Fix
Changes to
test_action_server.pyandtest_action_client.py— no library code changes.1. Bounded timeouts
Wrap all unbounded waits with
asyncio.wait_for(..., timeout=5.0). Normal test runtime is ~0.1–0.5s per test, so 5s is 10–50x headroom. A stuck test now raisesTimeoutErrorin 5s instead of hanging for 60s.In
test_action_server.py, the polling helpers are wrapped:In
test_action_client.py,reach_status(),wait(), andfeedback()calls are wrapped withasyncio.wait_for.2. Retry on timeout
Both test classes get a
retry_on_timeouthelper that resends the goal up to 3 times onTimeoutError. This mirrors the retry pattern already used byAsyncActionClient.ensure_goalin the library's own production code. Under transient CPU starvation a retry succeeds once the scheduler recovers; a real bug still fails after all retries are exhausted.Bonus
Replaced deprecated
assertEqualscalls withassertEqualintest_action_server.py.Testing
catkin run_tests aiorospy(no stress)stress-ng(before fix)stress-ng(timeout only, server tests)TimeoutError)stress-ng(timeout + retry, server tests only)test_action_client.pystress-ng(timeout + retry, both files)