Add operator readiness check to e2e upgrade test#4867
Draft
swiatekm wants to merge 9 commits intoopen-telemetry:mainfrom
Draft
Add operator readiness check to e2e upgrade test#4867swiatekm wants to merge 9 commits intoopen-telemetry:mainfrom
swiatekm wants to merge 9 commits intoopen-telemetry:mainfrom
Conversation
…iness The upgrade from v0.86.0 (v1alpha1 only) to the current operator adds v1beta1 as the storage version and configures a conversion webhook on the OpenTelemetryCollector CRD. cert-manager's cainjector must inject the CA bundle into the CRD's spec.conversion.webhook.clientConfig before the API server can reach the conversion webhook. `make deploy` waits for the operator pod rollout, but not for the CA injection. When step-02 immediately applies a v1beta1 collector CR, the API server tries the conversion webhook without a valid CA bundle, causing "context deadline exceeded" after the 15s apply timeout. Add `hack/check-operator-ready.go` after `make deploy` in step-01. Since v1beta1 is now the storage version, even the v1alpha1 collector create in the check requires the conversion webhook, so the check won't pass until cert-manager has injected the CA bundle into the CRD. https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
…undle The upgrade from v0.86.0 (v1alpha1 only) to the current operator adds v1beta1 as the storage version and configures a conversion webhook on the OpenTelemetryCollector CRD. cert-manager's cainjector must inject the CA bundle into the CRD's spec.conversion.webhook.clientConfig before the API server can reach the conversion webhook. `make deploy` waits for the operator pod rollout, but not for the CA injection. When step-02 immediately applies a v1beta1 collector CR, the API server tries the conversion webhook without a valid CA bundle, causing "context deadline exceeded" after the 15s apply timeout. Poll for the caBundle field on the CRD after deploy, which directly gates on the cert-manager reconciliation that's needed. https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
Contributor
E2E Test Results 34 files ±0 249 suites ±0 2h 10m 26s ⏱️ -51s For more details on these failures, see this check. Results for commit d801d78. ± Comparison against base commit 9f793a1. ♻️ This comment has been updated with latest results. |
Two issues cause the e2e-upgrade test to fail intermittently after `make deploy` upgrades the operator from v0.86.0: 1. Stale leader election lease: v0.86.0 does not set LeaderElectionReleaseOnCancel, so when its pod is terminated during the rolling update, the Lease resource is held for up to 137s. The new operator cannot reconcile until it acquires leadership, so the collector never scales to 2 replicas within the assert timeout. Fix: delete the old lease after deploy completes. 2. Missing CRD conversion webhook CA bundle: the upgrade adds v1beta1 as the storage version with a conversion webhook on the CRD. cert-manager's cainjector must inject the CA bundle before the API server can reach the webhook. `kubectl rollout status` does not wait for this. Fix: poll until the caBundle field is populated on the CRD. https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
On step-02 failure, capture: - Operator pod logs - CRD conversion webhook config (caBundle presence) - Leader election lease state - Operator deployment/pod status - Webhook configurations - API server logs (last 50 lines) - cert-manager and cainjector logs - OpenTelemetryCollector CR status - Collector deployment status https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
Root cause: During operator upgrade, `make deploy` updates the CRD which triggers the API server's internal cacher to reinitialize. Reinitialization requires listing v1beta1 objects via the conversion webhook. But at that moment the old operator pod is gone and the new one isn't ready, so the webhook returns 404. The cacher gets stuck in an error retry loop, and subsequent v1beta1 requests (like the step-02 apply) hit "http: Handler timeout". The existing caBundle wait was necessary but not sufficient — it only ensures the CA cert is injected, not that the webhook endpoint is actually serving. Fix: 1. Wait for the new operator deployment rollout to complete 2. Poll the v1beta1 API with a test GET until the API server cacher has successfully recovered and can serve requests https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
- Remove caBundle poll (subsumed by v1beta1 API readiness check) - Remove redundant rollout status (already done by make deploy) - Add leader election wait after lease deletion for realistic sequencing - Trim catch block to essential diagnostics https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
The operator pod that starts during `make deploy` establishes informer watches while the API server cacher is still recovering from conversion webhook failures. These watches silently stop delivering events, so the operator never sees the step-02 CR update and never scales the deployment. Fix: after confirming the v1beta1 API is healthy, restart the operator deployment so it opens fresh watches against a stable cacher. https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added a new step to the e2e upgrade test that verifies the operator is ready after deployment. This step waits until the API Server is actually ready to serve v1beta1 OpenTelemetryCollector CRs after the upgrade. The upgrade changes the storage version, which prompts some internal cache flushes and conversions that may need a few seconds.
Hopefully, this will fix the flakiness of this test.