Skip to content

Fix/graphql client token refresh#8791

Open
Ke-vin-S wants to merge 14 commits intoapache:mainfrom
Ke-vin-S:fix/graphql-client-token-refresh
Open

Fix/graphql client token refresh#8791
Ke-vin-S wants to merge 14 commits intoapache:mainfrom
Ke-vin-S:fix/graphql-client-token-refresh

Conversation

@Ke-vin-S
Copy link
Copy Markdown
Contributor

@Ke-vin-S Ke-vin-S commented Mar 22, 2026

Summary

This PR fixes two issues in the GitHub GraphQL client related to stability and authentication during long-running pipelines:

  1. Prevent process crash on rate-limit polling errors

    • Replaces panic in GraphqlAsyncClient background goroutine with graceful error handling.
    • On failure, errors are logged and the system retries in the next polling cycle while retaining the last known rate limit.
  2. Enable automatic token refresh for GraphQL requests

    • Replaces oauth2.StaticTokenSource-based HTTP client with the underlying http.Client from ApiAsyncClient.
    • This allows GraphQL requests to reuse the RefreshRoundTripper, enabling automatic token refresh for GitHub App installation tokens.
  3. Refactor authentication into shared HTTP client

    • Extracts auth and transport setup into CreateAuthenticatedHttpClient.
    • Enables reuse across REST and GraphQL clients.
  4. Introduce GraphQL client factory

    • Adds CreateGraphqlClient using shared authenticated HTTP client.
    • Replaces manual client setup while preserving existing behavior.
  5. Support static and refreshable tokens

    • Adds StaticRoundTripper for PAT authentication.
    • Ensures correct handling for both token types.

Does this close any open issues?

Closes #8788


Screenshots

N/A


Other Information

Design decisions

  • Avoid panic in background goroutines
    Rate-limit polling is non-critical and should not crash the entire pipeline. Errors are now handled gracefully.

  • Separate handling of initial vs runtime failures

    • Initial rate-limit fetch failure → fallback to default (5000) to prevent deadlock.
    • Runtime failure → retain last known rateRemaining to preserve correctness.
  • Use eventual consistency for rate limiting
    Rate-limit polling is best-effort; failures should not block or terminate execution.

  • Centralize transport layer for consistency
    Both GraphQL and REST clients now use a shared authenticated HTTP client, ensuring consistent authentication, proxy, and retry behavior.

  • Avoid StaticTokenSource
    Static tokens break long-running pipelines due to expiration. Using RefreshRoundTripper ensures seamless token rotation.

  • Handle static vs refreshable tokens explicitly
    Shared transport distinguishes between PAT (static) and installation (refreshable) tokens to avoid incorrect refresh/retry behavior.


@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. component/framework This issue or PR relates to the framework pr-type/bug-fix This PR fixes a bug priority/high This issue is very important labels Mar 22, 2026
@Ke-vin-S Ke-vin-S marked this pull request as draft March 22, 2026 05:40
@Ke-vin-S Ke-vin-S marked this pull request as ready for review March 22, 2026 18:47
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. component/plugins This issue or PR relates to plugins improvement pr-type/refactor This PR refactors existing features and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 22, 2026
@Ke-vin-S
Copy link
Copy Markdown
Contributor Author

@klesh I’ve pushed an update for this—could you please take a look? Happy to make changes if needed.

Copy link
Copy Markdown
Contributor

@danielemoraschi danielemoraschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments for your review. Thanks!

) (*helper.GraphqlAsyncClient, errors.Error) {

// inject the shared auth layer
httpClient, err := githubTasks.CreateAuthenticatedHttpClient(taskCtx, connection, nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this with nil as the base client I believe will default to an empty http.Client. This ignores the proxy configuration (connection.GetProxy()) and abandons the HTTP/SOCKS5 Proxy logic associated with the connection. It might break enterprise proxy setups as the proxy configuration is never attached, while the previous implementation explicitly injected a custom proxy http.Transport in backend/plugins/github_graphql/impl/impl.go into the GraphQL HTTP client. This refactoring drops that setup for corporate proxy users.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that actually happens in this code, I fixed it by passing the already configured 'http.Client' to this, it already has proxy, auth configurations.

connection.AuthMethod,
)

} else if connection.Token != "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connection.Token can contain multiple comma-separated tokens (e.g. tok1,tok2,tok3). The old code in impl.go split on comma and used only the first token (tokens[0]) via oauth2.StaticTokenSource. This passes connection.Token directly to StaticRoundTripper, which injects the entire unsplit string as Authorization: Bearer tok1,tok2,tok3. On the REST path, this also overwrites the correctly-rotated single token set by SetupAuthentication at the request level.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented that in static round tripper

logger.Info("Persisted initial token for connection %d", connection.ID)
}
}
println("http client HIT3")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover development debug print statement deployed in production code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

}

// resolveRateLimit determines the rate limit for GraphQL requests using task configuration -> else default constant.
func resolveRateLimit(taskCtx plugin.TaskContext, logger log.Logger) int {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveRateLimit reads env config at line 69, then opts (including WithFallbackRateLimit(5000)) are applied at lines 82-84, unconditionally overwriting the env value. Wouldn't a user setting GRAPHQL_RATE_LIMIT=2000 be silently overridden by the code-level "fallback" of 5000 from graphql_client.go:69?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

logger.Info(`github graphql rate limit are disabled, fallback to 5000req/hour`)
return 5000, nil, nil
logger.Info(`github graphql rate limit unavailable, using fallback rate limit`)
return 0, nil, errors.Default.New("rate limit unavailable")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code returned 5000, nil, nil for GitHub Enterprise with rate limiting disabled. This now returns an error instead of the previous 5000. The error propagates so the functionality is preserved, but every task run against such servers now produces a warning for what was previously a silently handled normal case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the Warn logs as Info logs. but kept the error here so the handling is centralized

@klesh
Copy link
Copy Markdown
Contributor

klesh commented Apr 1, 2026

@Ke-vin-S Great job! Would you like to take a look at @danielemoraschi’s comments?

@Ke-vin-S
Copy link
Copy Markdown
Contributor Author

Ke-vin-S commented Apr 1, 2026

@Ke-vin-S Great job! Would you like to take a look at @danielemoraschi’s comments?

Yes, currently working on it

@klesh
Copy link
Copy Markdown
Contributor

klesh commented Apr 1, 2026

Thanks for your contribution! Could you please take a look at the failed checks?

@zaiddialpad
Copy link
Copy Markdown

Hi — just wanted to share that we're actively hitting both issues this PR addresses in production, and this fix would be hugely impactful for us.

Our setup

  • DevLake v1.0.3-beta9 on GKE (Helm chart)
  • GitHub App auth across 4 connections (408 repos total)
  • Largest repo: ~20K PRs, ~1,200 currently OPEN

What we're experiencing

1. The updateRateRemaining panic (the primary fix in this PR)

Our lake pod is at 42 restarts — almost entirely from this panic. The sequence is:

  1. The github_graphql "Collect Pull Requests" subtask enters Phase 2 (refetching all OPEN PRs)
  2. Phase 2 sends heavy compound GraphQL queries (100 PRs per batch) that trigger 502 Bad Gateway from GitHub
  3. The retry loop runs for 10+ retries × 120s each, easily exceeding 1 hour
  4. Our GitHub App token expires (~1hr lifetime)
  5. The background updateRateRemaining goroutine calls /rate_limit with the expired token → 401 Unauthorized
  6. panic() at graphql_async_client.go:129 → pod crash
panic: non-200 OK status code: 401 Unauthorized body: "Bad credentials"

goroutine 3660 [running]:
github.com/apache/incubator-devlake/helpers/pluginhelper/api.(*GraphqlAsyncClient).updateRateRemaining.func1()
    /app/helpers/pluginhelper/api/graphql_async_client.go:129 +0x166

2. GraphQL client token refresh (the second fix in this PR)

Since the GraphQL client uses a StaticTokenSource that freezes the token at task start, any task running longer than ~1 hour with GitHub App auth is guaranteed to eventually fail. For us, the most important repo's github_graphql task takes 2-3 hours even when it succeeds, so we hit this on every single pipeline run.

The REST client was already fixed for this via #8746, but without this PR the GraphQL client remains vulnerable.

Impact on us

  • Largest Repo has failed on 4 of the last 6 daily pipeline runs due to this combination
  • Each failure wastes 15+ hours of pipeline time (retries + pod restarts + re-queuing)
  • The pod crashes affect all running tasks, not just the one that triggered the panic

Summary

This PR addresses exactly the two bugs compounding against us:

  1. The panic → replaced with graceful error handling ✅
  2. The static token → wired up to RefreshRoundTripper for automatic renewal ✅

Would love to see this merged. Happy to help test if that would be useful — we have a production environment that reliably reproduces both issues on every nightly sync.

@Ke-vin-S Ke-vin-S force-pushed the fix/graphql-client-token-refresh branch from a6895af to 6b9ad18 Compare April 8, 2026 02:53
@Ke-vin-S
Copy link
Copy Markdown
Contributor Author

Ke-vin-S commented Apr 8, 2026

Fixed commit message formatting - changed underscores to hyphens in scopes to pass the lint check. Force-pushed to update all commits.

Ke-vin-S added 10 commits April 8, 2026 08:57
…utine

Replace panic in GraphqlAsyncClient rate-limit polling goroutine with graceful error handling.

Previously, any error while fetching rate limit (e.g., transient network issues or 401 responses) would trigger a panic inside a background goroutine, crashing the entire DevLake process.

Now, errors are logged and the client retries in the next cycle while retaining the last known rate limit.

Design decisions:
- Avoid panic in background goroutines: rate-limit polling is non-critical and should not bring down the entire pipeline.
- Use last known rateRemaining on runtime failures instead of resetting or blocking, ensuring continued progress with eventual consistency.
- Retry via existing polling mechanism instead of immediate retry to prevent tight retry loops and unnecessary API pressure.
- Introduce a default fallback (5000) only for initial rate-limit fetch failures, since no prior state exists at startup.
- Separate handling of initial vs runtime failures:
  - Initial failure → fallback to default (5000)
  - Runtime failure → retain previous value

Fixes apache#8788 (bug 1)
…token refresh

Replace oauth2.StaticTokenSource-based HTTP client with the underlying http.Client from ApiAsyncClient.

Previously, the GraphQL client constructed its own HTTP client using StaticTokenSource, which froze the access token at task start time. This caused GitHub App installation tokens (which expire after ~1 hour) to become invalid during long-running pipelines, leading to persistent 401 errors.

Now, the GraphQL client reuses apiClient.GetClient(), which is already configured with RefreshRoundTripper and TokenProvider. This enables automatic token refresh on 401 responses, aligning GraphQL behavior with the REST client.

Design decisions:
- Reuse transport layer instead of duplicating authentication logic to ensure consistency across REST and GraphQL clients.
- Avoid StaticTokenSource, as it prevents token refresh and breaks long-running pipelines.
- Leverage existing RefreshRoundTripper for transparent token rotation without modifying GraphQL query logic.
- Keep protocol-specific logic (GraphQL vs REST) separate while sharing the underlying HTTP transport.

This ensures GraphQL pipelines using GitHub App authentication can run beyond token expiry without failure.

Fixes apache#8788 (bug 2)
…lient

- moved token provider and refresh round tripper setup into a reusable helper
- introduced CreateAuthenticatedHttpClient to centralize auth + transport logic
- updated CreateApiClient to use shared http client instead of inline setup

Rationale:
- decouples authentication (transport layer) from REST-specific client logic
- enables reuse for GraphQL client without duplicating token refresh logic
- aligns architecture with separation of concerns (http transport vs api clients)
…ntegrate into task flow

- added CreateGraphqlClient to encapsulate graphql client construction
- reused CreateAuthenticatedHttpClient from github/tasks to inject token refresh via RoundTripper
- replaced manual graphql client setup in PrepareTaskData with new factory function
- preserved existing rate limit handling via getRateRemaining callback
- preserved query cost calculation using SetGetRateCost

Technical details:
- graphql client now uses http transport with TokenProvider and RefreshRoundTripper
- removes dependency on oauth2 client and avoids token expiration issues
- decouples graphql client from REST ApiClient by avoiding reuse of apiClient.GetClient()
- maintains compatibility with github.com and enterprise graphql endpoints

Note:
- shared auth logic remains in github/tasks and is imported with alias to avoid package name collision
- introduces cross-plugin dependency (github_graphql → github/tasks) as a pragmatic tradeoff to avoid duplication
…ents

add StaticRoundTripper for PAT authentication and use it in the shared http client.

since the same client is used by both REST and GraphQL, auth handling must distinguish
between refreshable tokens and static tokens. avoid applying refresh/retry logic to PAT.

ensures correct behavior across clients and prevents unnecessary retries for static auth.
…e limit

Implement a layered fallback mechanism for GraphQL rate limiting:

1. Dynamic rate limit from provider (getRateRemaining)
2. Per-client override (WithFallbackRateLimit)
3. Config override (GRAPHQL_RATE_LIMIT)
4. Default fallback (1000)

Also moved GitHub-specific fallback (5000) via WithFallbackRateLimit
to the Graphql client.
Reused `http.Client` inside the apiClient returned by `CreateApiClient` method, so keeping the proxy and auth configurations the same.That also keep the centralized management of logic.
Priority order fixed for fallback rate limit, priority order is:
1.Env variable
2.Value set with `WithFallbackRateLimit`
3.default value in the code
This all works only when the `getRateRemaining` fails: hence the fallback
Ke-vin-S added 3 commits April 8, 2026 08:57
… for AccessToken connections

Previously, connection.Token (comma-separated PATs) was injected as-is
into the Authorization header, sending "Bearer tok1,tok2,tok3" instead
of a single rotated token.

StaticRoundTripper now splits the raw token string on comma and rotates
through tokens round-robin using an atomic counter.

For REST: StaticRoundTripper operates at transport level and always
overwrites the Authorization header set by SetupAuthentication.
SetupAuthentication is retained because conn.tokens is still required
by GetTokensCount() for rate limit calculation — but its header write
is superseded by StaticRoundTripper on every request.

For GraphQL: SetupAuthentication is never called by the graphql client,
so StaticRoundTripper is the only auth mechanism on this path — without
this fix, GraphQL requests were sent with the full unsplit token string.
…ck cycle

Replace exported GraphqlClientOption type with inline func(*GraphqlAsyncClient)
in CreateAsyncGraphqlClient signature. The named type caused mockery to generate
a mock file (GraphqlClientOption.go) that created an import cycle in tests.
@Ke-vin-S Ke-vin-S force-pushed the fix/graphql-client-token-refresh branch from 6b9ad18 to c7c5e72 Compare April 8, 2026 04:35
@Ke-vin-S
Copy link
Copy Markdown
Contributor Author

Ke-vin-S commented Apr 8, 2026

Hey @danielemoraschi , addressed your feedback. Would appreciate a second look.

@Ke-vin-S
Copy link
Copy Markdown
Contributor Author

Ke-vin-S commented Apr 8, 2026

Hi — just wanted to share that we're actively hitting both issues this PR addresses in production, and this fix would be hugely impactful for us.

Our setup

* **DevLake v1.0.3-beta9** on GKE (Helm chart)

* **GitHub App auth** across 4 connections (408 repos total)

* Largest repo: ~20K PRs, ~1,200 currently OPEN

What we're experiencing

1. The updateRateRemaining panic (the primary fix in this PR)

Our lake pod is at 42 restarts — almost entirely from this panic. The sequence is:

1. The `github_graphql` "Collect Pull Requests" subtask enters Phase 2 (refetching all OPEN PRs)

2. Phase 2 sends heavy compound GraphQL queries (100 PRs per batch) that trigger **502 Bad Gateway** from GitHub

3. The retry loop runs for 10+ retries × 120s each, easily exceeding 1 hour

4. Our GitHub App token expires (~1hr lifetime)

5. The background `updateRateRemaining` goroutine calls `/rate_limit` with the expired token → **401 Unauthorized**

6. `panic()` at `graphql_async_client.go:129` → pod crash
panic: non-200 OK status code: 401 Unauthorized body: "Bad credentials"

goroutine 3660 [running]:
github.com/apache/incubator-devlake/helpers/pluginhelper/api.(*GraphqlAsyncClient).updateRateRemaining.func1()
    /app/helpers/pluginhelper/api/graphql_async_client.go:129 +0x166

2. GraphQL client token refresh (the second fix in this PR)

Since the GraphQL client uses a StaticTokenSource that freezes the token at task start, any task running longer than ~1 hour with GitHub App auth is guaranteed to eventually fail. For us, the most important repo's github_graphql task takes 2-3 hours even when it succeeds, so we hit this on every single pipeline run.

The REST client was already fixed for this via #8746, but without this PR the GraphQL client remains vulnerable.

Impact on us

* **Largest Repo has failed on 4 of the last 6 daily pipeline runs** due to this combination

* Each failure wastes 15+ hours of pipeline time (retries + pod restarts + re-queuing)

* The pod crashes affect **all** running tasks, not just the one that triggered the panic

Summary

This PR addresses exactly the two bugs compounding against us:

1. The panic → replaced with graceful error handling ✅

2. The static token → wired up to `RefreshRoundTripper` for automatic renewal ✅

Would love to see this merged. Happy to help test if that would be useful — we have a production environment that reliably reproduces both issues on every nightly sync.

If you get a chance to test against your setup before it's merged, that'd be great — but no pressure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/framework This issue or PR relates to the framework component/plugins This issue or PR relates to plugins improvement pr-type/bug-fix This PR fixes a bug pr-type/refactor This PR refactors existing features priority/high This issue is very important size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][github_graphql] GitHub App installation token not refreshed for GraphQL client, panic and pipeline crash on large repos

4 participants