Skip to content

Add operator readiness check to e2e upgrade test#4867

Draft
swiatekm wants to merge 9 commits intoopen-telemetry:mainfrom
swiatekm:claude/fix-flaky-e2e-upgrade-qLq56
Draft

Add operator readiness check to e2e upgrade test#4867
swiatekm wants to merge 9 commits intoopen-telemetry:mainfrom
swiatekm:claude/fix-flaky-e2e-upgrade-qLq56

Conversation

@swiatekm
Copy link
Copy Markdown
Contributor

@swiatekm swiatekm commented Mar 16, 2026

Added a new step to the e2e upgrade test that verifies the operator is ready after deployment. This step waits until the API Server is actually ready to serve v1beta1 OpenTelemetryCollector CRs after the upgrade. The upgrade changes the storage version, which prompts some internal cache flushes and conversions that may need a few seconds.

Hopefully, this will fix the flakiness of this test.

claude added 2 commits March 16, 2026 18:19
…iness

The upgrade from v0.86.0 (v1alpha1 only) to the current operator adds
v1beta1 as the storage version and configures a conversion webhook on
the OpenTelemetryCollector CRD. cert-manager's cainjector must inject
the CA bundle into the CRD's spec.conversion.webhook.clientConfig
before the API server can reach the conversion webhook.

`make deploy` waits for the operator pod rollout, but not for the CA
injection. When step-02 immediately applies a v1beta1 collector CR,
the API server tries the conversion webhook without a valid CA bundle,
causing "context deadline exceeded" after the 15s apply timeout.

Add `hack/check-operator-ready.go` after `make deploy` in step-01.
Since v1beta1 is now the storage version, even the v1alpha1 collector
create in the check requires the conversion webhook, so the check
won't pass until cert-manager has injected the CA bundle into the CRD.

https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
…undle

The upgrade from v0.86.0 (v1alpha1 only) to the current operator adds
v1beta1 as the storage version and configures a conversion webhook on
the OpenTelemetryCollector CRD. cert-manager's cainjector must inject
the CA bundle into the CRD's spec.conversion.webhook.clientConfig
before the API server can reach the conversion webhook.

`make deploy` waits for the operator pod rollout, but not for the CA
injection. When step-02 immediately applies a v1beta1 collector CR,
the API server tries the conversion webhook without a valid CA bundle,
causing "context deadline exceeded" after the 15s apply timeout.

Poll for the caBundle field on the CRD after deploy, which directly
gates on the cert-manager reconciliation that's needed.

https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
@swiatekm swiatekm added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Mar 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 16, 2026

E2E Test Results

 34 files  ±0  249 suites  ±0   2h 10m 26s ⏱️ -51s
 96 tests ±0   95 ✅  - 1  0 💤 ±0  1 ❌ +1 
253 runs  ±0  252 ✅  - 1  0 💤 ±0  1 ❌ +1 

For more details on these failures, see this check.

Results for commit d801d78. ± Comparison against base commit 9f793a1.

♻️ This comment has been updated with latest results.

claude and others added 5 commits March 16, 2026 19:08
Two issues cause the e2e-upgrade test to fail intermittently after
`make deploy` upgrades the operator from v0.86.0:

1. Stale leader election lease: v0.86.0 does not set
   LeaderElectionReleaseOnCancel, so when its pod is terminated during
   the rolling update, the Lease resource is held for up to 137s. The
   new operator cannot reconcile until it acquires leadership, so the
   collector never scales to 2 replicas within the assert timeout.
   Fix: delete the old lease after deploy completes.

2. Missing CRD conversion webhook CA bundle: the upgrade adds v1beta1
   as the storage version with a conversion webhook on the CRD.
   cert-manager's cainjector must inject the CA bundle before the API
   server can reach the webhook. `kubectl rollout status` does not
   wait for this.
   Fix: poll until the caBundle field is populated on the CRD.

https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
On step-02 failure, capture:
- Operator pod logs
- CRD conversion webhook config (caBundle presence)
- Leader election lease state
- Operator deployment/pod status
- Webhook configurations
- API server logs (last 50 lines)
- cert-manager and cainjector logs
- OpenTelemetryCollector CR status
- Collector deployment status

https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
Root cause: During operator upgrade, `make deploy` updates the CRD which
triggers the API server's internal cacher to reinitialize. Reinitialization
requires listing v1beta1 objects via the conversion webhook. But at that
moment the old operator pod is gone and the new one isn't ready, so the
webhook returns 404. The cacher gets stuck in an error retry loop, and
subsequent v1beta1 requests (like the step-02 apply) hit "http: Handler
timeout".

The existing caBundle wait was necessary but not sufficient — it only
ensures the CA cert is injected, not that the webhook endpoint is
actually serving.

Fix:
1. Wait for the new operator deployment rollout to complete
2. Poll the v1beta1 API with a test GET until the API server cacher
   has successfully recovered and can serve requests

https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
- Remove caBundle poll (subsumed by v1beta1 API readiness check)
- Remove redundant rollout status (already done by make deploy)
- Add leader election wait after lease deletion for realistic sequencing
- Trim catch block to essential diagnostics

https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
@swiatekm swiatekm marked this pull request as ready for review March 19, 2026 12:05
@swiatekm swiatekm requested a review from a team as a code owner March 19, 2026 12:05
@swiatekm swiatekm marked this pull request as draft March 19, 2026 16:41
claude and others added 2 commits March 20, 2026 11:21
The operator pod that starts during `make deploy` establishes informer
watches while the API server cacher is still recovering from conversion
webhook failures. These watches silently stop delivering events, so the
operator never sees the step-02 CR update and never scales the deployment.

Fix: after confirming the v1beta1 API is healthy, restart the operator
deployment so it opens fresh watches against a stable cacher.

https://claude.ai/code/session_01BD4UdBPB682CK2fyAsLmuW
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Skip Changelog PRs that do not require a CHANGELOG.md entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants