CI: retry collect_ci_stats and report API errors clearly#6311
Merged
Conversation
Wrap the collect-stats invocation in retry.sh so transient GitHub API / network failures are retried instead of failing the job on the first error. In collect_ci_stats.py, call resp.raise_for_status() on the jobs request so a non-200 response surfaces as a clear HTTP error (and a non-zero exit that retry.sh can retry) instead of an opaque KeyError: 'jobs' coming from resp.json()['jobs']. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
oitel
requested changes
Jun 24, 2026
Comment on lines
+103
to
+104
| # Report a clear HTTP error (rate limit, auth, transient 5xx) and exit non-zero | ||
| # instead of a later KeyError on resp.json()['jobs']; this also lets retry.sh retry. |
| GIT_COMMIT: ${{ github.event.pull_request.head.sha || github.sha }} | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: python3 scripts/devops/collect_ci_stats.py | ||
| run: bash scripts/retry.sh -- python3 scripts/devops/collect_ci_stats.py |
Contributor
There was a problem hiding this comment.
Add the retry inside the Python script to fail early on genuine HTTP errors (auth failure, server errors).
Move the retry out of the workflow (retry.sh wrapper) and into fetch_jobs: retry only transient network errors, while raise_for_status() stays outside the retry loop so genuine HTTP errors (auth, server) fail fast. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
oitel
approved these changes
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
collect-statsjob fails intermittently with:fetch_jobs()indexesresp.json()['jobs']without checking the HTTP status. When the GitHub jobs API returns a non-200 body (auth failure, server error, transient network hiccup), there is nojobskey, so the real cause is masked by an opaqueKeyError.Fix
fetch_jobs()now handles failures explicitly:requests.getcatchesrequests.exceptions.RequestException, retrying 3 times with a 30s cooldown before giving up.resp.raise_for_status()sits outside the retry loop, so an error response is reported immediately with a clear message instead of being blindly retried or surfacing later asKeyError: 'jobs'.The job is best-effort (
continue-on-error: true), so this does not change merge gating; it just makes the telemetry resilient to flaky network calls and gives a readable error when the API genuinely fails.CI scope
Touches only the stats-collection script — no platform build output is affected. Disabled all build platforms except ubuntu-x64, which is kept enabled so the
collect-statsjob has real jobs/artifacts to exercise the change against.