Skip to content

CI: retry collect_ci_stats and report API errors clearly#6311

Merged
Fedr merged 2 commits into
masterfrom
ci/collect-stats-retry
Jun 24, 2026
Merged

CI: retry collect_ci_stats and report API errors clearly#6311
Fedr merged 2 commits into
masterfrom
ci/collect-stats-retry

Conversation

@Fedr

@Fedr Fedr commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Problem

The collect-stats job fails intermittently with:

Traceback (most recent call last):
  File "scripts/devops/collect_ci_stats.py", line 136, in <module>
    'jobs':        parse_jobs(resp.json()['jobs']),
KeyError: 'jobs'

fetch_jobs() indexes resp.json()['jobs'] without checking the HTTP status. When the GitHub jobs API returns a non-200 body (auth failure, server error, transient network hiccup), there is no jobs key, so the real cause is masked by an opaque KeyError.

Fix

fetch_jobs() now handles failures explicitly:

  • Transient network errors (connection reset, timeout, DNS) are retried — a loop around requests.get catches requests.exceptions.RequestException, retrying 3 times with a 30s cooldown before giving up.
  • Genuine HTTP errors (auth failure, server errors) fail fast — resp.raise_for_status() sits outside the retry loop, so an error response is reported immediately with a clear message instead of being blindly retried or surfacing later as KeyError: 'jobs'.

The job is best-effort (continue-on-error: true), so this does not change merge gating; it just makes the telemetry resilient to flaky network calls and gives a readable error when the API genuinely fails.

CI scope

Touches only the stats-collection script — no platform build output is affected. Disabled all build platforms except ubuntu-x64, which is kept enabled so the collect-stats job has real jobs/artifacts to exercise the change against.

Wrap the collect-stats invocation in retry.sh so transient GitHub API /
network failures are retried instead of failing the job on the first error.

In collect_ci_stats.py, call resp.raise_for_status() on the jobs request so a
non-200 response surfaces as a clear HTTP error (and a non-zero exit that
retry.sh can retry) instead of an opaque KeyError: 'jobs' coming from
resp.json()['jobs'].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread scripts/devops/collect_ci_stats.py Outdated
Comment on lines +103 to +104
# Report a clear HTTP error (rate limit, auth, transient 5xx) and exit non-zero
# instead of a later KeyError on resp.json()['jobs']; this also lets retry.sh retry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the comment.

GIT_COMMIT: ${{ github.event.pull_request.head.sha || github.sha }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python3 scripts/devops/collect_ci_stats.py
run: bash scripts/retry.sh -- python3 scripts/devops/collect_ci_stats.py

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the retry inside the Python script to fail early on genuine HTTP errors (auth failure, server errors).

Move the retry out of the workflow (retry.sh wrapper) and into fetch_jobs:
retry only transient network errors, while raise_for_status() stays outside the
retry loop so genuine HTTP errors (auth, server) fail fast.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Fedr Fedr merged commit 3e06fa7 into master Jun 24, 2026
26 checks passed
@Fedr Fedr deleted the ci/collect-stats-retry branch June 24, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants