Skip to content

fix: add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429#4712

Open
JeffreyDallas wants to merge 12 commits into
mainfrom
fix/github-api-rate-limit-retry
Open

fix: add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429#4712
JeffreyDallas wants to merge 12 commits into
mainfrom
fix/github-api-rate-limit-retry

Conversation

@JeffreyDallas

@JeffreyDallas JeffreyDallas commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Description

This pull request changes the following:

  • Adds GitHubApiClient with GITHUB_TOKEN auth and automatic retry on 403/429 responses (fixes GitHub API rate-limit failures in CI)
  • Increases NodesStarted event wait timeout from 10 → 30 minutes (env-var overridable via NODES_STARTED_EVENT_TIMEOUT_MINUTES) to fix relay timeout in one-shot deploy
  • Increases MirrorNodeDeployed event wait timeout from 10 → 10 minutes (env-var overridable via MIRROR_NODE_DEPLOYED_EVENT_TIMEOUT_MINUTES)
  • Increases mirror pinger pod readiness timeout from 15 → 30 minutes (MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS 450 → 900 × 2 s) to fix Windows/WSL2 slow-runner failures
  • Increases node-add-local and separate-node-add test timeouts to fix intermittent CI timeout failures (e2e add local test intermittently failure due to time out #4715)
  • Fixes one-shot-local-build example: skip helm dependency build when chart tarballs are already cached, avoiding Docker Hub unauthenticated 429 rate-limit errors (one shot local built test failed intermittently #4721)

Related Issues

Pull request (PR) checklist

  • This PR added tests (unit, integration, and/or end-to-end)
  • This PR updated documentation
  • This PR added no TODOs or commented out code
  • This PR has no breaking changes
  • Any technical debt has been documented as a separate issue and linked to this PR
  • Any package.json changes have been explained to and approved by a repository manager
  • All related issues have been linked to this PR
  • All changes in this PR are included in the description
  • When this PR merges the commits will be squashed and the title will be used as the commit message, the 'commit message guidelines' below have been followed

Testing

  • This PR added unit tests
  • This PR added integration/end-to-end tests
  • These changes required manual testing that is documented below
  • Anything not tested is documented

The following manual testing was done:

  • Unit tests for GitHubApiClient (9 tests covering auth header injection, 403/429 retry with backoff, 404 passthrough, non-auth requests)
  • CI run on fix/github-api-rate-limit-retry branch

The following was not tested:

  • Windows pinger 30-minute timeout (relied on CI)
  • one-shot-local-build helm cache skip (cache-miss path still contacts Docker Hub on first run)

Intermittent HTTP 403 errors from the GitHub API on Windows runners were
caused by two compounding issues:

1. EdgeVersionFetcher never included an Authorization header, leaving all
   five component-version lookups unauthenticated (60 req/hour shared per
   IP on GitHub-hosted runners).
2. No retry logic existed for transient 403/429 rate-limit responses, so
   a single bad response permanently failed the dependency-check step.

Introduce GitHubApiClient, a static utility class that:
- Adds a Bearer token from GITHUB_TOKEN when present (raises limit to
  5 000 req/hour for authenticated requests).
- Retries on HTTP 403 and 429 with exponential backoff (up to 3 attempts),
  honouring the Retry-After and X-RateLimit-Reset response headers.

Wire GitHubApiClient.get() into EdgeVersionFetcher and all four dependency
managers (crane, gvproxy, podman, vfkit), replacing ~15 lines of duplicated
header-building + fetch code in each.

Add 9 unit tests for GitHubApiClient covering auth, retry, Retry-After
header parsing, exhaustion after max retries, no-retry on non-rate-limit
errors, and network-failure wrapping.

Fixes #4711

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas requested a review from a team as a code owner June 17, 2026 18:23
@trunk-io

trunk-io Bot commented Jun 17, 2026

Copy link
Copy Markdown

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@JeffreyDallas JeffreyDallas self-assigned this Jun 17, 2026
@JeffreyDallas JeffreyDallas added P1-💎 Current Milestone & Goals PR: Needs Team Approval A pull request that needs review from a team member. labels Jun 17, 2026
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results - Linux

38 tests  +2   38 ✅ +2   0s ⏱️ ±0s
17 suites ±0    0 💤 ±0 
 1 files   ±0    0 ❌ ±0 

Results for commit 8359896. ± Comparison against base commit de23795.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results - Windows

    1 files  ±0    334 suites  +2   9s ⏱️ ±0s
1 049 tests +9  1 049 ✅ +9  0 💤 ±0  0 ❌ ±0 
1 053 runs  +9  1 053 ✅ +9  0 💤 ±0  0 ❌ ±0 

Results for commit 8359896. ± Comparison against base commit de23795.

♻️ This comment has been updated with latest results.

JeffreyDallas and others added 2 commits June 17, 2026 14:09
On Windows runners using WSL2, the mirror-node pinger pod takes longer
to become ready than the default 300×2 s = 10 minutes because it must:
1. pass its own startup probe (/tmp/alive file),
2. connect to the Mirror REST API (http://mirror-1-restjava:80),
3. submit a transaction through the consensus network, and
4. verify that transaction was ingested by the mirror importer.

All of these steps are slower under WSL2 due to network-layer indirection,
and the generic PODS_READY_MAX_ATTEMPTS budget (shared with every other
pod check) is often exhausted just as the pinger is about to go Ready.

Add MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS (default 450) and
MIRROR_NODE_PINGER_PODS_READY_DELAY (default 2 000 ms), giving pinger
checks a 15-minute budget — consistent with relay and block-node —
while leaving every other mirror-node pod check at the existing 10-minute
limit.  Both constants are overridable via environment variables.

Observed failure: https://github.com/hiero-ledger/solo/actions/runs/27705538841/job/81956787144?pr=4703

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
The relay's withWaitCondition for NodesStarted was set to 10 minutes,
but the node start sequence emits NodesStarted only after the full chain
completes (including waitForTss), which can take 15-25+ minutes on slow
or busy runners. Both timeouts are now env-var overridable constants:
NODES_STARTED_EVENT_TIMEOUT_MINUTES (default 30) and
MIRROR_NODE_DEPLOYED_EVENT_TIMEOUT_MINUTES (default 10).

Fixes #4714

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

E2E Test Report

 10 files  ±0   94 suites  ±0   1h 25m 8s ⏱️ + 1m 50s
301 tests ±0  301 ✅ ±0  0 💤 ±0  0 ❌ ±0 
320 runs  ±0  320 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 242d1cc. ± Comparison against base commit de23795.

♻️ This comment has been updated with latest results.

JeffreyDallas and others added 3 commits June 17, 2026 15:21
The describe block in separate-node-add.test.ts had a 3-minute default
timeout that matched the intermittent failure reported in #4715 (Mocha
applies the describe-level timeout when the describe callback is async).
The "should add a new node to the network successfully" test runs three
sequential prepare/submit/execute commands that can take 10-15+ minutes
on slow CI runners, but only had a 12-minute timeout.

- describe block default: 3 min → 20 min
- "should add a new node" test: 12 min → 20 min
- outer describe in node-add-local: 3 min → 30 min

Fixes #4715

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Windows/WSL2 runners need more time for the mirror pinger pod to become
ready during concurrent one-shot deploys. Previous 15-min limit (450
attempts × 2000ms) was insufficient; bumped to 900 attempts = 30 min.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
… 429

When helm dependency build runs for OCI chart repos (registry-1.docker.io/
bitnamicharts), it contacts Docker Hub for manifest verification even when
the tarball already exists in charts/, triggering unauthenticated rate-limit
429 errors. Fix: only run helm dependency build on cache miss; on cache hit,
restore tarballs and skip the build entirely. The cache key is the version.ts
hash, so the same key guarantees identical chart versions and tarballs.

Fixes #4721

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas requested a review from a team as a code owner June 18, 2026 02:32
Comment thread src/core/github-api-client.ts Outdated
import {SoloErrors} from './errors/solo-errors.js';

const RETRY_MAX_ATTEMPTS: number = 3;
const RETRY_BASE_DELAY_MS: number = 1000;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Duration with to toMilis(), also consider better placements for those constants

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. changed to private constants

@jan-milenkov jan-milenkov added the PR: Unresolved Comments A pull request where there are comments and they need to be resolved. label Jun 18, 2026
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas removed the PR: Unresolved Comments A pull request where there are comments and they need to be resolved. label Jun 18, 2026
JeffreyDallas and others added 5 commits June 18, 2026 10:00
…lBackOff

On Windows/WSL2 runners, the solo-shared-resources postgresql pod fails with
ImagePullBackOff because bitnami/postgresql:latest is not pre-cached into
the Kind cluster. This causes a cascade: PostgreSQL never starts → mirror
REST health check fails → mirror pinger can never become ready. Adding the
image to solo-cache-images-target.yaml ensures it is pre-loaded before
the deploy, avoiding Docker Hub unauthenticated rate-limit 429s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
…record stall

Without a block node, CN defaults to FILE_AND_GRPC writerMode which fills the
gRPC buffer (maxBlocks=5) and stalls record file production after ~20s. Mirror
importer falls behind, pinger can never confirm transactions, and the one-shot
deploy times out. Fix profile-manager to explicitly set FILE_ONLY when no block
nodes are in the deployment state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
… node

FILE_ONLY is not a valid BlockStreamWriterMode enum value in CN v0.74; the
correct value is FILE. Using FILE_ONLY caused CN to fail to become ACTIVE
across all E2E tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Reverts 242d1cc and f9e70fd. The writerMode changes are not needed
for the Windows pinger fix and are being removed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P1-💎 Current Milestone & Goals PR: Needs Team Approval A pull request that needs review from a team member.

Projects

None yet

2 participants