Nigthly Fix for Slow Tests#72549
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for parsing a --timeout argument from test script arguments and applying it to the subprocess wait call. It also updates the subprocess wrapper to terminate the process if it times out during wait(), and increases the termination timeout constant. Feedback on these changes highlights two main issues: first, in tasks.py, calling self.terminate() is asynchronous and may result in a None return code, which can cause a crash in the caller; a join with a fallback return value should be used instead. Second, in run_python_test.py, parsing --timeout directly from the argument list lacks bounds and type checking, which can lead to IndexError or ValueError and should be made more robust.
|
PR #72549: Size comparison from cf7110d to dc2cc97 Full report (25 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32)
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #72549 +/- ##
==========================================
+ Coverage 55.85% 56.76% +0.90%
==========================================
Files 1638 1634 -4
Lines 112255 112660 +405
Branches 13400 13143 -257
==========================================
+ Hits 62702 63948 +1246
+ Misses 49553 48712 -841 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
PR #72549: Size comparison from cf7110d to a9d85e7 Full report (25 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32)
|
FrancoLionti
left a comment
There was a problem hiding this comment.
I have those three comments so far not but not too big a deal actually.
| split_args = shlex.split(script_args) | ||
| timeout_key = '--timeout' | ||
| test_arg_timeout = None | ||
| timeout_index = split_args.index(timeout_key) if timeout_key in split_args else None |
There was a problem hiding this comment.
The manual --timeout lookup in run_python_test.py only handles the space-separated form (--timeout 20). It looks for a standalone --timeout token and reads the next one. With --timeout=20, shlex.split() produces a single token, so the lookup misses it, test_arg_timeout stays None, and the outer Subprocess.wait() falls back to the 300s default — silently undoing the fix this PR is trying to make.
No current caller uses the = form (CI metadata uses --timeout 1800, etc.), so this is latent rather than a live bug.
runner.py already parses --timeout via argparse (type=int), which handles both forms and validates the value. Worth reusing that here (e.g. a small parse_known_args with just --timeout) instead of hand-parsing, so the orchestrator and the test script read the flag the same way.
There was a problem hiding this comment.
I far I know we don't use that format for CI arguments in the CI TEST ARGUMENTS section from python tests.
There was a problem hiding this comment.
By the way only a few tests use "=" for arguments and for what I see only for app arguments and not for script arguments:
You can check by using sed on src/python_testing
sed -n '/BEGIN CI TEST ARGUMENTS/,/END CI TEST ARGUMENTS/p' src/python_testing/*py | grep -v "===" | grep =
for more information, see https://pre-commit.ci
|
PR #72549: Size comparison from cf7110d to e18b1f2 Full report (25 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32)
|
Summary
Currently Nightly Jobs are failing because the tests are timing out but are reported as completed with correct status. This PR fix this issue:
Logs reference: https://github.com/project-chip/connectedhomeip/actions/runs/26790493182/job/78977530766
This PR fixes how to handle the subprocess wait and timeout and terminate the test and report the status code to
run_python_tests.py.Also currently the --timeout is the test timeout or the timeout the author considers is enough for the test to complete and we should let the test report this timeout. In the case for some reason this test is stuck and the framework does not respond in the test timeout the run_python_test should terminate it and avoid let zoombies process, that is the reason a +60 seconds is added on the subprocess kill over the limit provided by the --timeout in the test.
run_python_test global timeout: --timeout seconds + 60 seconds
test_timeout: --timeout -> Test can reach this timeout and fail or if is stuck global timeout will terminate the pid.
Related issues
Fixes: project-chip/matter-test-scripts#787
Testing
There are two ways to test this:
Locally:
Open a slow test like TC_SU tests which contains the --timeout argument with a timeout larger than 3 minutes.
Update the timeout with a time under 3 minutes or for quick results a --timeout of 20 seconds.
Test example:
Update the --timeout
Execute the test with
run_python_test.pypython3 scripts/tests/run_python_test.py --load-from-env /tmp/test_env.yaml --script src/python_testing/TC_SU_2_5.pyIn a second tab open the test logs.
tail -F -n 3000 /tmp/matter_testing/logs/MatterTest/latest/test_log.INFOVerify the logs from
run_python_tests.pyreturn a status code after the 20 seconds.Verify in the test log the test is terminated properly.
CI Jobs:
Uncomment the test section from Nightly jobs and comment the section from schedule:
Verfiy Jobs are completed properly and terminated properly not as see in the first screenshot from this PR.
Checklist