Add operator readiness check to e2e upgrade test by swiatekm · Pull Request #4867 · open-telemetry/opentelemetry-operator

swiatekm · 2026-03-16T18:26:12Z

Added a new step to the e2e upgrade test that verifies the operator is ready after deployment. This step waits until the API Server is actually ready to serve v1beta1 OpenTelemetryCollector CRs after the upgrade. The upgrade changes the storage version, which prompts some internal cache flushes and conversions that may need a few seconds.

Hopefully, this will fix the flakiness of this test.

…iness The upgrade from v0.86.0 (v1alpha1 only) to the current operator adds v1beta1 as the storage version and configures a conversion webhook on the OpenTelemetryCollector CRD. cert-manager's cainjector must inject the CA bundle into the CRD's spec.conversion.webhook.clientConfig before the API server can reach the conversion webhook. `make deploy` waits for the operator pod rollout, but not for the CA injection. When step-02 immediately applies a v1beta1 collector CR, the API server tries the conversion webhook without a valid CA bundle, causing "context deadline exceeded" after the 15s apply timeout. Add `hack/check-operator-ready.go` after `make deploy` in step-01. Since v1beta1 is now the storage version, even the v1alpha1 collector create in the check requires the conversion webhook, so the check won't pass until cert-manager has injected the CA bundle into the CRD. https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW

…undle The upgrade from v0.86.0 (v1alpha1 only) to the current operator adds v1beta1 as the storage version and configures a conversion webhook on the OpenTelemetryCollector CRD. cert-manager's cainjector must inject the CA bundle into the CRD's spec.conversion.webhook.clientConfig before the API server can reach the conversion webhook. `make deploy` waits for the operator pod rollout, but not for the CA injection. When step-02 immediately applies a v1beta1 collector CR, the API server tries the conversion webhook without a valid CA bundle, causing "context deadline exceeded" after the 15s apply timeout. Poll for the caBundle field on the CRD after deploy, which directly gates on the cert-manager reconciliation that's needed. https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW

github-actions · 2026-03-16T18:49:24Z

E2E Test Results

34 files ±0 249 suites ±0 2h 10m 26s ⏱️ -51s
96 tests ±0 95 ✅ - 1 0 💤 ±0 1 ❌ +1
253 runs ±0 252 ✅ - 1 0 💤 ±0 1 ❌ +1

For more details on these failures, see this check.

Results for commit d801d78. ± Comparison against base commit 9f793a1.

♻️ This comment has been updated with latest results.

Two issues cause the e2e-upgrade test to fail intermittently after `make deploy` upgrades the operator from v0.86.0: 1. Stale leader election lease: v0.86.0 does not set LeaderElectionReleaseOnCancel, so when its pod is terminated during the rolling update, the Lease resource is held for up to 137s. The new operator cannot reconcile until it acquires leadership, so the collector never scales to 2 replicas within the assert timeout. Fix: delete the old lease after deploy completes. 2. Missing CRD conversion webhook CA bundle: the upgrade adds v1beta1 as the storage version with a conversion webhook on the CRD. cert-manager's cainjector must inject the CA bundle before the API server can reach the webhook. `kubectl rollout status` does not wait for this. Fix: poll until the caBundle field is populated on the CRD. https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW

On step-02 failure, capture: - Operator pod logs - CRD conversion webhook config (caBundle presence) - Leader election lease state - Operator deployment/pod status - Webhook configurations - API server logs (last 50 lines) - cert-manager and cainjector logs - OpenTelemetryCollector CR status - Collector deployment status https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW

Root cause: During operator upgrade, `make deploy` updates the CRD which triggers the API server's internal cacher to reinitialize. Reinitialization requires listing v1beta1 objects via the conversion webhook. But at that moment the old operator pod is gone and the new one isn't ready, so the webhook returns 404. The cacher gets stuck in an error retry loop, and subsequent v1beta1 requests (like the step-02 apply) hit "http: Handler timeout". The existing caBundle wait was necessary but not sufficient — it only ensures the CA cert is injected, not that the webhook endpoint is actually serving. Fix: 1. Wait for the new operator deployment rollout to complete 2. Poll the v1beta1 API with a test GET until the API server cacher has successfully recovered and can serve requests https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW

- Remove caBundle poll (subsumed by v1beta1 API readiness check) - Remove redundant rollout status (already done by make deploy) - Add leader election wait after lease deletion for realistic sequencing - Trim catch block to essential diagnostics https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW

The operator pod that starts during `make deploy` establishes informer watches while the API server cacher is still recovering from conversion webhook failures. These watches silently stop delivering events, so the operator never sees the step-02 CR update and never scales the deployment. Fix: after confirming the v1beta1 API is healthy, restart the operator deployment so it opens fresh watches against a stable cacher. https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW

claude added 2 commits March 16, 2026 18:19

swiatekm added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Mar 16, 2026

claude and others added 5 commits March 16, 2026 19:08

Merge branch 'main' into claude/fix-flaky-e2e-upgrade-qLq56

b285da5

swiatekm marked this pull request as ready for review March 19, 2026 12:05

swiatekm requested a review from a team as a code owner March 19, 2026 12:05

swiatekm marked this pull request as draft March 19, 2026 16:41

claude and others added 2 commits March 20, 2026 11:21

Merge branch 'main' into claude/fix-flaky-e2e-upgrade-qLq56

d801d78

swiatekm mentioned this pull request Mar 23, 2026

Make the e2e-upgrade test less flaky #4887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add operator readiness check to e2e upgrade test#4867

Add operator readiness check to e2e upgrade test#4867
swiatekm wants to merge 9 commits intoopen-telemetry:mainfrom
swiatekm:claude/fix-flaky-e2e-upgrade-qLq56

swiatekm commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swiatekm commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

swiatekm commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading