fix: add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429#4712
Open
JeffreyDallas wants to merge 12 commits into
Open
fix: add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429#4712JeffreyDallas wants to merge 12 commits into
JeffreyDallas wants to merge 12 commits into
Conversation
Intermittent HTTP 403 errors from the GitHub API on Windows runners were caused by two compounding issues: 1. EdgeVersionFetcher never included an Authorization header, leaving all five component-version lookups unauthenticated (60 req/hour shared per IP on GitHub-hosted runners). 2. No retry logic existed for transient 403/429 rate-limit responses, so a single bad response permanently failed the dependency-check step. Introduce GitHubApiClient, a static utility class that: - Adds a Bearer token from GITHUB_TOKEN when present (raises limit to 5 000 req/hour for authenticated requests). - Retries on HTTP 403 and 429 with exponential backoff (up to 3 attempts), honouring the Retry-After and X-RateLimit-Reset response headers. Wire GitHubApiClient.get() into EdgeVersionFetcher and all four dependency managers (crane, gvproxy, podman, vfkit), replacing ~15 lines of duplicated header-building + fetch code in each. Add 9 unit tests for GitHubApiClient covering auth, retry, Retry-After header parsing, exhaustion after max retries, no-retry on non-rate-limit errors, and network-failure wrapping. Fixes #4711 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
Contributor
Contributor
On Windows runners using WSL2, the mirror-node pinger pod takes longer to become ready than the default 300×2 s = 10 minutes because it must: 1. pass its own startup probe (/tmp/alive file), 2. connect to the Mirror REST API (http://mirror-1-restjava:80), 3. submit a transaction through the consensus network, and 4. verify that transaction was ingested by the mirror importer. All of these steps are slower under WSL2 due to network-layer indirection, and the generic PODS_READY_MAX_ATTEMPTS budget (shared with every other pod check) is often exhausted just as the pinger is about to go Ready. Add MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS (default 450) and MIRROR_NODE_PINGER_PODS_READY_DELAY (default 2 000 ms), giving pinger checks a 15-minute budget — consistent with relay and block-node — while leaving every other mirror-node pod check at the existing 10-minute limit. Both constants are overridable via environment variables. Observed failure: https://github.com/hiero-ledger/solo/actions/runs/27705538841/job/81956787144?pr=4703 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
The relay's withWaitCondition for NodesStarted was set to 10 minutes, but the node start sequence emits NodesStarted only after the full chain completes (including waitForTss), which can take 15-25+ minutes on slow or busy runners. Both timeouts are now env-var overridable constants: NODES_STARTED_EVENT_TIMEOUT_MINUTES (default 30) and MIRROR_NODE_DEPLOYED_EVENT_TIMEOUT_MINUTES (default 10). Fixes #4714 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Contributor
The describe block in separate-node-add.test.ts had a 3-minute default timeout that matched the intermittent failure reported in #4715 (Mocha applies the describe-level timeout when the describe callback is async). The "should add a new node to the network successfully" test runs three sequential prepare/submit/execute commands that can take 10-15+ minutes on slow CI runners, but only had a 12-minute timeout. - describe block default: 3 min → 20 min - "should add a new node" test: 12 min → 20 min - outer describe in node-add-local: 3 min → 30 min Fixes #4715 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Windows/WSL2 runners need more time for the mirror pinger pod to become ready during concurrent one-shot deploys. Previous 15-min limit (450 attempts × 2000ms) was insufficient; bumped to 900 attempts = 30 min. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
… 429 When helm dependency build runs for OCI chart repos (registry-1.docker.io/ bitnamicharts), it contacts Docker Hub for manifest verification even when the tarball already exists in charts/, triggering unauthenticated rate-limit 429 errors. Fix: only run helm dependency build on cache miss; on cache hit, restore tarballs and skip the build entirely. The cache key is the version.ts hash, so the same key guarantees identical chart versions and tarballs. Fixes #4721 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
| import {SoloErrors} from './errors/solo-errors.js'; | ||
|
|
||
| const RETRY_MAX_ATTEMPTS: number = 3; | ||
| const RETRY_BASE_DELAY_MS: number = 1000; |
Contributor
There was a problem hiding this comment.
use Duration with to toMilis(), also consider better placements for those constants
Contributor
Author
There was a problem hiding this comment.
updated. changed to private constants
…lBackOff On Windows/WSL2 runners, the solo-shared-resources postgresql pod fails with ImagePullBackOff because bitnami/postgresql:latest is not pre-cached into the Kind cluster. This causes a cascade: PostgreSQL never starts → mirror REST health check fails → mirror pinger can never become ready. Adding the image to solo-cache-images-target.yaml ensures it is pre-loaded before the deploy, avoiding Docker Hub unauthenticated rate-limit 429s. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
…record stall Without a block node, CN defaults to FILE_AND_GRPC writerMode which fills the gRPC buffer (maxBlocks=5) and stalls record file production after ~20s. Mirror importer falls behind, pinger can never confirm transactions, and the one-shot deploy times out. Fix profile-manager to explicitly set FILE_ONLY when no block nodes are in the deployment state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
… node FILE_ONLY is not a valid BlockStreamWriterMode enum value in CN v0.74; the correct value is FILE. Using FILE_ONLY caused CN to fail to become ACTIVE across all E2E tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request changes the following:
GitHubApiClientwithGITHUB_TOKENauth and automatic retry on 403/429 responses (fixes GitHub API rate-limit failures in CI)NodesStartedevent wait timeout from 10 → 30 minutes (env-var overridable viaNODES_STARTED_EVENT_TIMEOUT_MINUTES) to fix relay timeout in one-shot deployMirrorNodeDeployedevent wait timeout from 10 → 10 minutes (env-var overridable viaMIRROR_NODE_DEPLOYED_EVENT_TIMEOUT_MINUTES)MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS450 → 900 × 2 s) to fix Windows/WSL2 slow-runner failuresnode-add-localandseparate-node-addtest timeouts to fix intermittent CI timeout failures (e2e add local test intermittently failure due to time out #4715)one-shot-local-buildexample: skiphelm dependency buildwhen chart tarballs are already cached, avoiding Docker Hub unauthenticated 429 rate-limit errors (one shot local built test failed intermittently #4721)Related Issues
Pull request (PR) checklist
package.jsonchanges have been explained to and approved by a repository managerTesting
The following manual testing was done:
GitHubApiClient(9 tests covering auth header injection, 403/429 retry with backoff, 404 passthrough, non-auth requests)fix/github-api-rate-limit-retrybranchThe following was not tested:
one-shot-local-buildhelm cache skip (cache-miss path still contacts Docker Hub on first run)