Skip to content

🐛 addon fix for Install namespace potential race condition with addondeploymentconfig - For issue: https://github.com/open-cluster-management-io/ocm/issues/1465#381

Open
tesshuflower wants to merge 2 commits into
open-cluster-management-io:mainfrom
tesshuflower:installns-addondeployconfig-race-cond-ocm1465
Open

🐛 addon fix for Install namespace potential race condition with addondeploymentconfig - For issue: https://github.com/open-cluster-management-io/ocm/issues/1465#381
tesshuflower wants to merge 2 commits into
open-cluster-management-io:mainfrom
tesshuflower:installns-addondeployconfig-race-cond-ocm1465

Conversation

@tesshuflower

@tesshuflower tesshuflower commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a race condition where addon controllers use the default namespace (open-cluster-management-agent-addon) before AddonDeploymentConfig has been synced to MCA status by the addonconfiguration controller.

When the registration controller (addon-framework) reconciles an MCA before the addonconfiguration controller (OCM addon-manager) has written configReferences, GetDesiredAddOnDeploymentConfig returns nil, nil — meaning "no config exists." Callers fall back to the default namespace, which is wrong if an AddonDeploymentConfig with a custom
namespace is actually configured.

The fix checks two signals in MCA status before accepting an empty configReferences as authoritative:

  • supportedConfigs — does the addon declare support for addondeploymentconfigs?
  • Configured condition — has the addonconfiguration controller set it to True?

If the addon supports deployment configs but Configured is not yet True, an error is returned so callers requeue instead of proceeding with stale data. The check is in GetDesiredAddOnDeploymentConfig to protect all three callers: AgentInstallNamespaceFromDeploymentConfigFunc, GetAddOnDeploymentConfigValues, and GetAgentImageValues.

Things to note:

  • This does return a new error in the case where we determine that the addonconfiguration controller has not yet updated. The code was put into GetDesiredAddOnDeploymentConfig() instead of AgentInstallNamespaceFromDeploymentConfigFunc directly. this means it does affect more callers as indicated above.
  • To determine if the configReferences have not been populated yet, we are using the .status.Condition Configured and checking that it exists and is True. This means potentially we keep reconciling until this happens. Is there any potential issue here? It seems the configurationcontroller updates the status.configReferences and adds the Configured condition at the same time.
  • Additionally, this requires the caller to retry on error - this does generally happen here:
    ns, err := registrationOption.AgentInstallNamespace(managedClusterAddonCopy)
    if err != nil {
    return err
    }
    . However other callers could choose to ignore I suppose.

Related issue(s)

Fixes #

open-cluster-management-io/ocm#1465

Summary by CodeRabbit

  • Bug Fixes

    • Improved addon deployment configuration handling: the app now distinguishes missing configuration from unsupported addons, triggers retries when an addon supports deployment config but isn't yet configured, and continues gracefully for unsupported or already-configured cases.
  • Tests

    • Added comprehensive tests covering supported/unsupported addon scenarios, missing or false configured conditions, and various config-reference states to validate retry and defaulting behavior.

@openshift-ci

openshift-ci Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tesshuflower
Once this PR has been reviewed and has the lgtm label, please assign qiujian16 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from qiujian16 and zhiweiyin318 May 1, 2026 13:51
@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown

Walkthrough

GetDesiredAddOnDeploymentConfig now distinguishes a missing ConfigReference from a pending addon rollout by checking whether the addon supports AddOnDeploymentConfig and whether its ManagedClusterAddOnConditionConfigured condition is True; it returns a retryable error when support exists but Configured is not True. Two unexported helpers were added; tests expanded.

Changes

Deployment config resolution and tests

Layer / File(s) Summary
Core behavior change
pkg/utils/addon_config.go
When no ConfigReference exists, the function now checks addon support and the Configured condition and returns an error to trigger retry if the addon supports deployment configs but is not Configured=true.
Helpers
pkg/utils/addon_config.go
Added addonSupportsDeploymentConfig(addon *ManagedClusterAddOn) bool and addonConfiguredTrue(addon *ManagedClusterAddOn) bool to inspect Status.SupportedConfigs and ManagedClusterAddOnConditionConfigured respectively.
Imports
pkg/utils/addon_config.go
Imported k8s.io/apimachinery/pkg/api/meta to read status conditions.
Tests — existing test table updates
pkg/utils/addon_config_test.go
TestAgentInstallNamespaceFromDeploymentConfigFunc table gains expectError and new cases covering: addon nil, support present but Configured missing/false (requeue), Configured=true defaulting paths, and addon not supporting deployment config.
Tests — new unit
pkg/utils/addon_config_test.go
Added TestGetDesiredAddOnDeploymentConfig covering: missing config ref under varying support/Configured states, config refs with valid spec hash (success), and empty spec hash (error).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, approved

Suggested reviewers

  • zhiweiyin318
  • haoqing0110
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main fix—handling a race condition in addon namespace selection with deployment configs—and references the related issue.
Description check ✅ Passed The description includes a comprehensive summary of the race condition, the fix approach, and key implementation notes addressing potential concerns.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.1)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tesshuflower tesshuflower changed the title 🐛 addon fix for Install namespace potential race condition with addondeploymentconfig 🐛 addon fix for Install namespace potential race condition with addondeploymentconfig - For issue: https://github.com/open-cluster-management-io/ocm/issues/1465 May 1, 2026

@qiujian16 qiujian16 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @haoqing0110

tesshuflower and others added 2 commits May 6, 2026 14:05
…CA status

Check in AgentInstallNamespaceFromDeploymentConfigFunc: if the addon
declares support for addondeploymentconfigs but the Configured condition
is not yet True, return an error so callers requeue instead of
proceeding with the default namespace.

Addresses: open-cluster-management-io/ocm#1465

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tesshu Flower <tflower@redhat.com>
…mentConfig

Move the Configured condition check from AgentInstallNamespaceFromDeploymentConfigFunc
into GetDesiredAddOnDeploymentConfig so all three callers (install namespace,
helm values, image overrides) are protected from the race.

Add dedicated TestGetDesiredAddOnDeploymentConfig covering the new guard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tesshu Flower <tflower@redhat.com>
@tesshuflower tesshuflower force-pushed the installns-addondeployconfig-race-cond-ocm1465 branch from b1ab45f to 6c7647d Compare May 6, 2026 18:13

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/utils/addon_config.go (1)

142-149: 💤 Low value

Heads up: this guard depends on Status.SupportedConfigs being populated, which is itself controller-written.

If a freshly created MCA is reconciled before the addon manager has populated Status.SupportedConfigs, addonSupportsDeploymentConfig returns false and the function still falls through to nil, nil — i.e., the same default-namespace race the PR aims to close, just on a (likely smaller) timing window. In practice SupportedConfigs is written very early from the CMA template, so this is usually fine; flagging for awareness rather than as a blocker. If you want full coverage, an alternate signal would be checking the CMA's supportedConfigs directly, or treating "MCA exists but SupportedConfigs empty AND no terminal Configured condition" as a retry case.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/utils/addon_config.go` around lines 142 - 149, The guard
addonSupportsDeploymentConfig currently reads addon.Status.SupportedConfigs
which may be empty before the addon-controller populates it, causing a false
negative and leaving callers to fall through to the default-namespace race;
update the check in addonSupportsDeploymentConfig (or its caller) to treat an
empty Status.SupportedConfigs as indeterminate: either read the source CR's
supportedConfigs field if available (the ManagedClusterAddOn spec/template
value) or return a retry signal when Status.SupportedConfigs is empty and there
is no terminal Configured condition on the ManagedClusterAddOn, so the
reconciler requeues instead of assuming “false.” Ensure you reference the
ManagedClusterAddOn object (addon) and its Status.SupportedConfigs and
Configured condition when implementing this change.
pkg/utils/addon_config_test.go (1)

308-461: 💤 Low value

LGTM — TestGetDesiredAddOnDeploymentConfig mirrors the new branch coverage.

Six cases covering: no support / no ref, support+no Configured, support+Configured=False, support+Configured=True (authoritative nil), valid spec hash, empty spec hash. Good direct unit coverage of the helper rather than relying solely on the wrapper test.

One small optional nit: the assertions could be tightened by also checking the error message for the retry-path cases (e.g., assert.Contains(t, err.Error(), "Configured condition is not True")) so a future refactor that returns a different error from the same code path doesn't silently still pass. Not a blocker.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/utils/addon_config_test.go` around lines 308 - 461,
TestGetDesiredAddOnDeploymentConfig currently checks only error presence for
retry-path cases; tighten the assertions by also validating the error message
content for those cases (e.g., when calling GetDesiredAddOnDeploymentConfig from
the test and expectError==true) so the retry-path returns the expected reason
string. Update the test loop in TestGetDesiredAddOnDeploymentConfig to, after
assert.Error(t, err), call assert.Contains(t, err.Error(), "Configured condition
is not True") (or the exact message emitted by GetDesiredAddOnDeploymentConfig)
for the cases that model the configured-not-true/retry scenarios; keep existing
checks for expectNil/NotNil unchanged. Use the function/test names
GetDesiredAddOnDeploymentConfig and TestGetDesiredAddOnDeploymentConfig to
locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/utils/addon_config_test.go`:
- Around line 308-461: TestGetDesiredAddOnDeploymentConfig currently checks only
error presence for retry-path cases; tighten the assertions by also validating
the error message content for those cases (e.g., when calling
GetDesiredAddOnDeploymentConfig from the test and expectError==true) so the
retry-path returns the expected reason string. Update the test loop in
TestGetDesiredAddOnDeploymentConfig to, after assert.Error(t, err), call
assert.Contains(t, err.Error(), "Configured condition is not True") (or the
exact message emitted by GetDesiredAddOnDeploymentConfig) for the cases that
model the configured-not-true/retry scenarios; keep existing checks for
expectNil/NotNil unchanged. Use the function/test names
GetDesiredAddOnDeploymentConfig and TestGetDesiredAddOnDeploymentConfig to
locate the code.

In `@pkg/utils/addon_config.go`:
- Around line 142-149: The guard addonSupportsDeploymentConfig currently reads
addon.Status.SupportedConfigs which may be empty before the addon-controller
populates it, causing a false negative and leaving callers to fall through to
the default-namespace race; update the check in addonSupportsDeploymentConfig
(or its caller) to treat an empty Status.SupportedConfigs as indeterminate:
either read the source CR's supportedConfigs field if available (the
ManagedClusterAddOn spec/template value) or return a retry signal when
Status.SupportedConfigs is empty and there is no terminal Configured condition
on the ManagedClusterAddOn, so the reconciler requeues instead of assuming
“false.” Ensure you reference the ManagedClusterAddOn object (addon) and its
Status.SupportedConfigs and Configured condition when implementing this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f27e1f9e-014d-4330-92e1-f287490b074d

📥 Commits

Reviewing files that changed from the base of the PR and between b1ab45f and 6c7647d.

📒 Files selected for processing (2)
  • pkg/utils/addon_config.go
  • pkg/utils/addon_config_test.go

@tesshuflower

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

@tesshuflower: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mikeshng

mikeshng commented May 6, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@mikeshng

mikeshng commented May 6, 2026

Copy link
Copy Markdown
Member

@tesshuflower if the retrigger bot is not working, just let me know (here or DM if I missed the email). I can retry it. Hopefully when you are in the org, you will be able to do this.

@tesshuflower

Copy link
Copy Markdown
Contributor Author

@tesshuflower if the retrigger bot is not working, just let me know (here or DM if I missed the email). I can retry it. Hopefully when you are in the org, you will be able to do this.

Thanks Mike! I've been trying to run the e2e locally to dig into it a bit - my change in this PR could cause the addon to take longer to reconcile as it needs to wait for the config controller to set the status - but I think it should work in general, even in this particular e2e test.

@tesshuflower

Copy link
Copy Markdown
Contributor Author

@qiujian16 I'm still looking for a review on this one, to verify that the approach of using the status conditions "Configured" set to True as the indicator that the config controller is done setting the configReferences - and therefore that we can trust the value there.

Comment thread pkg/utils/addon_config.go
}

func addonSupportsDeploymentConfig(addon *addonapiv1beta1.ManagedClusterAddOn) bool {
for _, sc := range addon.Status.SupportedConfigs {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addon.Status.SupportedConfigs is for addon users to know to supported config type. In code level, https://github.com/open-cluster-management-io/addon-framework/blob/main/pkg/addonmanager/controllers/addonconfig/controller.go#L177 , should check the configGVRs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @haoqing0110 - I guess I was having trouble seeing why the configGVRs would get loaded in this code - but using Status.SupportedConfigs has a similar timing issue as it's something else that's setting that status.

I think in the end, are you ok if we use configured=True condition to decide that the configReferences are ready to be consumed? It seems that this is the correct condition to check.

However, there is the one caveat that where a ClusterManagementAddOn may have this annotation: addon.open-cluster-management.io/lifecycle: "self" - in this case, the external controller does not ever set configured=True condition.

Do you think this is a concern? I think with v1beta1 this is less of a concern since there is no such idea of "supportedConfigs" anymore - and if the annotation is set by a user, do we not expect them to call AgentInstallNamespaceFromDeploymentConfigFunc() at all?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tesshuflower I think it's good to check the condition configured=True. And after migrating to v1beta1, the code won't have the annotation any more. open-cluster-management-io/ocm#1428

Comment thread pkg/utils/addon_config.go
// or hasn't finished rolling out configs. In either case configReferences may be
// incomplete. Return an error so callers retry rather than proceeding with no config.
if addonSupportsDeploymentConfig(addon) && !addonConfiguredTrue(addon) {
return nil, fmt.Errorf("addon %s supports addondeploymentconfigs but Configured condition is not True yet, need to retry",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have an integration test on this? Also if it returns err, the controller will backoff upon error. Instead of returning an error, should we return a state that "nothing is configured yet" so the caller will know nothing should be handled, and when addon status is updated, this func will be triggered again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a concern I mentioned in my description as well - the issue is the function AgentInstallNamespaceFromDeploymentConfigFunc() is essentially a helper function provided by the addon-framework that addons are already using to get the namespace from the deploymentconfig - Should I consider deprecating it and making a new function instead that can return the extra status?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiujian16 What do you think about my comment above? My main concern with changing the function signature is this is a function already called by users of the addon-framework - If I want to return another parameter to tell them to retry, this necessitates a change in the function or perhaps a new one. Do you have a preference here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think it makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiujian16 @haoqing0110 I've come back to this one, and was going through changes required to add new functions to be able to get the namespace (with retries) - however there are a lot of cascading affected places.

For example addon framework interface functions would need updating too, example:

AgentInstallNamespace AgentInstallNamespaceFunc

And then even from getValues functions like here:

func (a *HelmAgentAddon) getValues(

Essentially all of those don't currently have a strict retry mechanism and would need to change or get updated.

But then I found we actually have this ConfigCheckEnabled setting already:

ConfigCheckEnabled bool

And this is used in a couple places before calling the agent Manifests() function.

Do you think perhaps I should pivot to a fix that simply uses ConfigCheckEnabled instead? Ultimately it's the same exact guard - it checks for configured=true in the status before proceeding. I feel like my current fix in this PR is maybe bypassing the `ConfigCheckEnabled as well as getting overly complicated.

Essentially a fix would be to make sure we do something similar to this:

https://github.com/open-cluster-management-io/addon-framework/blob/main/pkg/addonmanager/controllers/agentdeploy/controller.go#L409-L413

Anytime before the getAgentNamespace (of even agent Manifests()) is called. This would essentially give us the same protection, but would only be turned on when users set ConfigCheckEnabled.

Sorry the above is very long, I've made an alternative draft PR here to demonstrate the changes required with this approach: #382

The comments on the original function with the race condition attempt to explain it as well: https://github.com/open-cluster-management-io/addon-framework/pull/382/changes#diff-0268a17aba5249e6bd421172ca6ac3c9c0e93ef819030b1cba41afbe50d89743

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants