🐛 addon fix for Install namespace potential race condition with addondeploymentconfig - For issue: https://github.com/open-cluster-management-io/ocm/issues/1465#381
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tesshuflower The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughGetDesiredAddOnDeploymentConfig now distinguishes a missing ConfigReference from a pending addon rollout by checking whether the addon supports AddOnDeploymentConfig and whether its ManagedClusterAddOnConditionConfigured condition is True; it returns a retryable error when support exists but Configured is not True. Two unexported helpers were added; tests expanded. ChangesDeployment config resolution and tests
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.1)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…CA status Check in AgentInstallNamespaceFromDeploymentConfigFunc: if the addon declares support for addondeploymentconfigs but the Configured condition is not yet True, return an error so callers requeue instead of proceeding with the default namespace. Addresses: open-cluster-management-io/ocm#1465 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tesshu Flower <tflower@redhat.com>
…mentConfig Move the Configured condition check from AgentInstallNamespaceFromDeploymentConfigFunc into GetDesiredAddOnDeploymentConfig so all three callers (install namespace, helm values, image overrides) are protected from the race. Add dedicated TestGetDesiredAddOnDeploymentConfig covering the new guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tesshu Flower <tflower@redhat.com>
b1ab45f to
6c7647d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/utils/addon_config.go (1)
142-149: 💤 Low valueHeads up: this guard depends on
Status.SupportedConfigsbeing populated, which is itself controller-written.If a freshly created MCA is reconciled before the addon manager has populated
Status.SupportedConfigs,addonSupportsDeploymentConfigreturns false and the function still falls through tonil, nil— i.e., the same default-namespace race the PR aims to close, just on a (likely smaller) timing window. In practiceSupportedConfigsis written very early from the CMA template, so this is usually fine; flagging for awareness rather than as a blocker. If you want full coverage, an alternate signal would be checking the CMA'ssupportedConfigsdirectly, or treating "MCA exists butSupportedConfigsempty AND no terminal Configured condition" as a retry case.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/utils/addon_config.go` around lines 142 - 149, The guard addonSupportsDeploymentConfig currently reads addon.Status.SupportedConfigs which may be empty before the addon-controller populates it, causing a false negative and leaving callers to fall through to the default-namespace race; update the check in addonSupportsDeploymentConfig (or its caller) to treat an empty Status.SupportedConfigs as indeterminate: either read the source CR's supportedConfigs field if available (the ManagedClusterAddOn spec/template value) or return a retry signal when Status.SupportedConfigs is empty and there is no terminal Configured condition on the ManagedClusterAddOn, so the reconciler requeues instead of assuming “false.” Ensure you reference the ManagedClusterAddOn object (addon) and its Status.SupportedConfigs and Configured condition when implementing this change.pkg/utils/addon_config_test.go (1)
308-461: 💤 Low valueLGTM —
TestGetDesiredAddOnDeploymentConfigmirrors the new branch coverage.Six cases covering: no support / no ref, support+no Configured, support+Configured=False, support+Configured=True (authoritative nil), valid spec hash, empty spec hash. Good direct unit coverage of the helper rather than relying solely on the wrapper test.
One small optional nit: the assertions could be tightened by also checking the error message for the retry-path cases (e.g.,
assert.Contains(t, err.Error(), "Configured condition is not True")) so a future refactor that returns a different error from the same code path doesn't silently still pass. Not a blocker.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/utils/addon_config_test.go` around lines 308 - 461, TestGetDesiredAddOnDeploymentConfig currently checks only error presence for retry-path cases; tighten the assertions by also validating the error message content for those cases (e.g., when calling GetDesiredAddOnDeploymentConfig from the test and expectError==true) so the retry-path returns the expected reason string. Update the test loop in TestGetDesiredAddOnDeploymentConfig to, after assert.Error(t, err), call assert.Contains(t, err.Error(), "Configured condition is not True") (or the exact message emitted by GetDesiredAddOnDeploymentConfig) for the cases that model the configured-not-true/retry scenarios; keep existing checks for expectNil/NotNil unchanged. Use the function/test names GetDesiredAddOnDeploymentConfig and TestGetDesiredAddOnDeploymentConfig to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/utils/addon_config_test.go`:
- Around line 308-461: TestGetDesiredAddOnDeploymentConfig currently checks only
error presence for retry-path cases; tighten the assertions by also validating
the error message content for those cases (e.g., when calling
GetDesiredAddOnDeploymentConfig from the test and expectError==true) so the
retry-path returns the expected reason string. Update the test loop in
TestGetDesiredAddOnDeploymentConfig to, after assert.Error(t, err), call
assert.Contains(t, err.Error(), "Configured condition is not True") (or the
exact message emitted by GetDesiredAddOnDeploymentConfig) for the cases that
model the configured-not-true/retry scenarios; keep existing checks for
expectNil/NotNil unchanged. Use the function/test names
GetDesiredAddOnDeploymentConfig and TestGetDesiredAddOnDeploymentConfig to
locate the code.
In `@pkg/utils/addon_config.go`:
- Around line 142-149: The guard addonSupportsDeploymentConfig currently reads
addon.Status.SupportedConfigs which may be empty before the addon-controller
populates it, causing a false negative and leaving callers to fall through to
the default-namespace race; update the check in addonSupportsDeploymentConfig
(or its caller) to treat an empty Status.SupportedConfigs as indeterminate:
either read the source CR's supportedConfigs field if available (the
ManagedClusterAddOn spec/template value) or return a retry signal when
Status.SupportedConfigs is empty and there is no terminal Configured condition
on the ManagedClusterAddOn, so the reconciler requeues instead of assuming
“false.” Ensure you reference the ManagedClusterAddOn object (addon) and its
Status.SupportedConfigs and Configured condition when implementing this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f27e1f9e-014d-4330-92e1-f287490b074d
📒 Files selected for processing (2)
pkg/utils/addon_config.gopkg/utils/addon_config_test.go
|
/retest |
|
@tesshuflower: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
@tesshuflower if the retrigger bot is not working, just let me know (here or DM if I missed the email). I can retry it. Hopefully when you are in the org, you will be able to do this. |
Thanks Mike! I've been trying to run the e2e locally to dig into it a bit - my change in this PR could cause the addon to take longer to reconcile as it needs to wait for the config controller to set the status - but I think it should work in general, even in this particular e2e test. |
|
@qiujian16 I'm still looking for a review on this one, to verify that the approach of using the status conditions "Configured" set to True as the indicator that the config controller is done setting the configReferences - and therefore that we can trust the value there. |
| } | ||
|
|
||
| func addonSupportsDeploymentConfig(addon *addonapiv1beta1.ManagedClusterAddOn) bool { | ||
| for _, sc := range addon.Status.SupportedConfigs { |
There was a problem hiding this comment.
addon.Status.SupportedConfigs is for addon users to know to supported config type. In code level, https://github.com/open-cluster-management-io/addon-framework/blob/main/pkg/addonmanager/controllers/addonconfig/controller.go#L177 , should check the configGVRs.
There was a problem hiding this comment.
Thanks for this @haoqing0110 - I guess I was having trouble seeing why the configGVRs would get loaded in this code - but using Status.SupportedConfigs has a similar timing issue as it's something else that's setting that status.
I think in the end, are you ok if we use configured=True condition to decide that the configReferences are ready to be consumed? It seems that this is the correct condition to check.
However, there is the one caveat that where a ClusterManagementAddOn may have this annotation: addon.open-cluster-management.io/lifecycle: "self" - in this case, the external controller does not ever set configured=True condition.
Do you think this is a concern? I think with v1beta1 this is less of a concern since there is no such idea of "supportedConfigs" anymore - and if the annotation is set by a user, do we not expect them to call AgentInstallNamespaceFromDeploymentConfigFunc() at all?
There was a problem hiding this comment.
@tesshuflower I think it's good to check the condition configured=True. And after migrating to v1beta1, the code won't have the annotation any more. open-cluster-management-io/ocm#1428
| // or hasn't finished rolling out configs. In either case configReferences may be | ||
| // incomplete. Return an error so callers retry rather than proceeding with no config. | ||
| if addonSupportsDeploymentConfig(addon) && !addonConfiguredTrue(addon) { | ||
| return nil, fmt.Errorf("addon %s supports addondeploymentconfigs but Configured condition is not True yet, need to retry", |
There was a problem hiding this comment.
can we have an integration test on this? Also if it returns err, the controller will backoff upon error. Instead of returning an error, should we return a state that "nothing is configured yet" so the caller will know nothing should be handled, and when addon status is updated, this func will be triggered again.
There was a problem hiding this comment.
Yes, this is a concern I mentioned in my description as well - the issue is the function AgentInstallNamespaceFromDeploymentConfigFunc() is essentially a helper function provided by the addon-framework that addons are already using to get the namespace from the deploymentconfig - Should I consider deprecating it and making a new function instead that can return the extra status?
There was a problem hiding this comment.
@qiujian16 What do you think about my comment above? My main concern with changing the function signature is this is a function already called by users of the addon-framework - If I want to return another parameter to tell them to retry, this necessitates a change in the function or perhaps a new one. Do you have a preference here?
There was a problem hiding this comment.
@qiujian16 @haoqing0110 I've come back to this one, and was going through changes required to add new functions to be able to get the namespace (with retries) - however there are a lot of cascading affected places.
For example addon framework interface functions would need updating too, example:
addon-framework/pkg/agent/interface.go
Line 73 in 929daf0
And then even from getValues functions like here:
Essentially all of those don't currently have a strict retry mechanism and would need to change or get updated.
But then I found we actually have this ConfigCheckEnabled setting already:
addon-framework/pkg/agent/interface.go
Line 112 in 929daf0
And this is used in a couple places before calling the agent Manifests() function.
Do you think perhaps I should pivot to a fix that simply uses ConfigCheckEnabled instead? Ultimately it's the same exact guard - it checks for configured=true in the status before proceeding. I feel like my current fix in this PR is maybe bypassing the `ConfigCheckEnabled as well as getting overly complicated.
Essentially a fix would be to make sure we do something similar to this:
Anytime before the getAgentNamespace (of even agent Manifests()) is called. This would essentially give us the same protection, but would only be turned on when users set ConfigCheckEnabled.
Sorry the above is very long, I've made an alternative draft PR here to demonstrate the changes required with this approach: #382
The comments on the original function with the race condition attempt to explain it as well: https://github.com/open-cluster-management-io/addon-framework/pull/382/changes#diff-0268a17aba5249e6bd421172ca6ac3c9c0e93ef819030b1cba41afbe50d89743
Summary
Fixes a race condition where addon controllers use the default namespace (open-cluster-management-agent-addon) before AddonDeploymentConfig has been synced to MCA status by the addonconfiguration controller.
When the registration controller (addon-framework) reconciles an MCA before the addonconfiguration controller (OCM addon-manager) has written configReferences, GetDesiredAddOnDeploymentConfig returns nil, nil — meaning "no config exists." Callers fall back to the default namespace, which is wrong if an AddonDeploymentConfig with a custom
namespace is actually configured.
The fix checks two signals in MCA status before accepting an empty configReferences as authoritative:
If the addon supports deployment configs but Configured is not yet True, an error is returned so callers requeue instead of proceeding with stale data. The check is in GetDesiredAddOnDeploymentConfig to protect all three callers: AgentInstallNamespaceFromDeploymentConfigFunc, GetAddOnDeploymentConfigValues, and GetAgentImageValues.
Things to note:
Configuredand checking that it exists and is True. This means potentially we keep reconciling until this happens. Is there any potential issue here? It seems the configurationcontroller updates the status.configReferences and adds the Configured condition at the same time.addon-framework/pkg/addonmanager/controllers/registration/controller.go
Lines 278 to 281 in bbaaa1e
Related issue(s)
Fixes #
open-cluster-management-io/ocm#1465
Summary by CodeRabbit
Bug Fixes
Tests