Skip to content

Fix flaky action server and client tests under CPU-starved CI#50

Open
vik748 wants to merge 1 commit intolocusrobotics:masterfrom
rios-ai:fix/flaky-action-server-tests
Open

Fix flaky action server and client tests under CPU-starved CI#50
vik748 wants to merge 1 commit intolocusrobotics:masterfrom
rios-ai:fix/flaky-action-server-tests

Conversation

@vik748
Copy link
Copy Markdown

@vik748 vik748 commented Mar 2, 2026

Fix flaky action server and client tests under CPU-starved CI

Problem

test_action_server and test_action_client intermittently hang for the full 60s rostest timeout in CI before being killed with KeyboardInterrupt. This produces misleading errors:

ERROR: max time [60.0s] allotted for test [test_action_server]

During teardown, Python's interpreter cleanup clears builtins before pending asyncio tasks finish logging, producing a secondary NameError: name 'open' is not defined — a red herring that obscures the real issue.

Root cause

Test helpers poll indefinitely for goal status transitions. In test_action_server.py:

async def wait_for_status(self, goal_handle, status):
    while goal_handle.get_goal_status() != status:
        await asyncio.sleep(0.1)

In test_action_client.py, the unbounded calls are to goal_handle.reach_status(), goal_handle.wait(), and the goal_handle.feedback() async generator — all of which block indefinitely if the expected transition never arrives.

Under CPU/IO contention (shared CI runners, parallel builds), the ROS spinner thread is starved and cross-thread callbacks (loop.call_soon_threadsafe in AsyncActionServer.start(), asyncio.run_coroutine_threadsafe in _AsyncGoalHandle._transition_cb) are delayed. The polling loops / condition waits never see the status transition and block until rostest kills the process.

Reproduction

We reproduced the flaky behavior locally using stress-ng to simulate CI resource contention:

# Build the package
catkin build aiorospy

# Start heavy CPU/IO/VM stress in background
stress-ng --cpu 20 --io 10 --vm 4 --vm-bytes 256M --timeout 3600s --quiet &

# Run tests in a loop, clearing cached results each time
for i in $(seq 1 100); do
    rm -rf build/aiorospy/test_results
    catkin run_tests aiorospy --no-status --no-notify
done

Before fix (no timeouts, no retries): test_goal_succeeded hung on run 4 out of 10, blocking for 58s before rostest killed it. The rostest log showed all ROS topics were registered but the goal callback never fired on the asyncio event loop — confirming spinner thread starvation as the cause.

After adding 5s timeouts only (no retries): Failed on run 24 out of 100 with a clean asyncio.TimeoutError — fast failure instead of a 60s hang, but still flaky.

After adding 5s timeouts + 3 retries to test_action_server.py only: Passed 21 consecutive runs under the same stress load. Run 22 failed, but in test_action_client.py which had the same unbounded pattern — confirming both files needed the fix.

Fix

Changes to test_action_server.py and test_action_client.py — no library code changes.

1. Bounded timeouts

Wrap all unbounded waits with asyncio.wait_for(..., timeout=5.0). Normal test runtime is ~0.1–0.5s per test, so 5s is 10–50x headroom. A stuck test now raises TimeoutError in 5s instead of hanging for 60s.

In test_action_server.py, the polling helpers are wrapped:

WAIT_TIMEOUT = 5.0

async def wait_for_status(self, goal_handle, status):
    async def _poll():
        while goal_handle.get_goal_status() != status:
            await asyncio.sleep(0.1)
    await asyncio.wait_for(_poll(), timeout=WAIT_TIMEOUT)

In test_action_client.py, reach_status(), wait(), and feedback() calls are wrapped with asyncio.wait_for.

2. Retry on timeout

Both test classes get a retry_on_timeout helper that resends the goal up to 3 times on TimeoutError. This mirrors the retry pattern already used by AsyncActionClient.ensure_goal in the library's own production code. Under transient CPU starvation a retry succeeds once the scheduler recovers; a real bug still fails after all retries are exhausted.

MAX_RETRIES = 3

async def retry_on_timeout(self, fn):
    for attempt in range(MAX_RETRIES):
        try:
            return await fn()
        except asyncio.TimeoutError:
            if attempt == MAX_RETRIES - 1:
                raise

Bonus

Replaced deprecated assertEquals calls with assertEqual in test_action_server.py.

Testing

Scenario Result
catkin run_tests aiorospy (no stress) 21 tests, 0 errors, 0 failures
100 iterations under stress-ng (before fix) Failed on run 4 (60s hang)
100 iterations under stress-ng (timeout only, server tests) Failed on run 24 (5s TimeoutError)
100 iterations under stress-ng (timeout + retry, server tests only) 21 passes; run 22 failed in unmodified test_action_client.py
100 iterations under stress-ng (timeout + retry, both files) 100/100 pass

The test helpers in test_action_server.py (wait_for_status,
wait_for_result) and the reach_status/wait/feedback calls in
test_action_client.py all poll or block indefinitely. Under
CPU/IO contention on shared CI runners, the ROS spinner thread
is starved and cross-thread callbacks (call_soon_threadsafe,
run_coroutine_threadsafe) are delayed, causing tests to hang
for the full 60s rostest timeout.

Add bounded 5s timeouts via asyncio.wait_for to all unbounded
waits, and a retry_on_timeout helper that resends goals up to
3 times — mirroring the pattern already used by
AsyncActionClient.ensure_goal in production code.

Also replace deprecated assertEquals with assertEqual.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant