OCPBUGS-84150: MachineOSConfig "build was not found" error#5878
OCPBUGS-84150: MachineOSConfig "build was not found" error#5878dkhater-redhat wants to merge 5 commits intoopenshift:mainfrom
Conversation
When a MachineOSConfig's current-machine-os-build annotation points to: 1. A build for an old rendered config (annotation is stale), OR 2. A build that no longer exists (deleted/pruned) The node controller would log warnings but not clear the annotation. This prevented the build controller from creating a new build for the current rendered config, causing pools to get stuck at 0 machines configured with the error 'Image is not ready yet'. Fix: actively clear the stale annotation when detected. This allows the build controller to see the annotation is missing and trigger creation of a new MachineOSBuild for the current rendered config. Fixes: OCPBUGS-XXXXX - MachineOSConfig 'build was not found' error
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@dkhater-redhat: This pull request references Jira Issue OCPBUGS-84150, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
WalkthroughNode controller now detects when a Changes
Sequence Diagram(s)sequenceDiagram
participant NodeController
participant API_Server
participant Reconciler
participant BuildStore
NodeController->>API_Server: Read MachineOSConfig (has current-machine-os-build)
API_Server-->>NodeController: Return resource
NodeController->>BuildStore: Lookup referenced MachineOSBuild
alt Build missing or mismatched
NodeController->>API_Server: Update MachineOSConfig (delete annotation)
API_Server-->>NodeController: Update result
NodeController->>Reconciler: Return error / signal missing annotation
Reconciler->>API_Server: Read MachineOSConfig (annotation empty)
API_Server-->>Reconciler: Return resource
Reconciler->>Reconciler: Set needsImageRebuild = true
Reconciler->>BuildStore: Create new MachineOSBuild
BuildStore-->>Reconciler: Build created
else Build exists and matches
NodeController-->>Reconciler: Continue with existing build
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dkhater-redhat The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Test validates that when a MachineOSConfig has a stale current-machine-os-build annotation (pointing to a build for an old rendered config or a non-existent build), the annotation is properly cleared to allow the build controller to create a new build. Test covers four scenarios: 1. Stale annotation - build exists but for old rendered config 2. Stale annotation - build does not exist 3. Valid annotation - build exists for current rendered config 4. No annotation - fallback to matching build
2109ca1 to
7b612b4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controller/node/node_controller.go (1)
1178-1208:⚠️ Potential issue | 🟠 MajorGetter now mutates cluster state without retry-on-conflict or fresh API fetch.
Several concerns about this block:
Unretried write on a cached object.
ourConfigoriginates fromctrl.moscLister, soconfigCopycarries a cache-localResourceVersion. A concurrent update by the build reconciler (e.g., atpkg/controller/build/reconciler.go:339) will trigger a 409Conflict. Unlike Node mutations in this same file (updateCandidateNode,setUpdateInProgressTaint— lines 1478, 1799, 1831), which all useclientretry.RetryOnConflict, this MOSC write fails immediately. Even the build reconciler re-fetches via lister before its MOSC Update. Re-Get from the API and wrap withRetryOnConflictto handle concurrent writers.Read path now mutates.
getConfigAndBuildis called in a hot read path (getConfigAndBuildAndLayeredStatus→syncMachineConfigPool); any Update failure prevents the caller from seeing the mosc/mosb. Consider returning an additionalstaleAnnotationsignal and lettingsyncMachineConfigPool(or a dedicated helper) perform the clear, so lookup failures and clear failures are clearly distinguishable.Convoluted flag.
staleAnnotationis set in two branches and then re-checked via!staleAnnotationto detect "build name not found at all." AfoundBuildlocal and a single post-check is clearer.
context.TODO()with no timeout. A deadline-bounded context derived from the sync would prevent a slow API server from stalling the worker.Returned
ourConfigstill has the old annotation in memory. After a successful Update you return the unmodifiedourConfig, notconfigCopy. Harmless for current callers (they don't re-persist it), but easy to miss if a future caller reads the annotation again on the same pass.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/node/node_controller.go` around lines 1178 - 1208, getConfigAndBuild currently clears a stale annotation by updating a cache-backed ourConfig (from ctrl.moscLister) with context.TODO() and no retry, which can cause 409s, blocks, and returns the old object; change the logic so you detect a stale annotation but do not perform an immediate write on the lister object: instead re-get a fresh MachineOSConfig from the API, perform the delete on that copy, and update using client.MachineconfigurationV1().MachineOSConfigs().Update inside clientretry.RetryOnConflict with a context derived from the caller (add a timeout/deadline), and after a successful update return the updated config (or signal the staleAnnotation to syncMachineConfigPool to perform the clear); refer to getConfigAndBuild, ourConfig, configCopy, ctrl.client.MachineconfigurationV1().MachineOSConfigs().Update, ctrl.moscLister and syncMachineConfigPool when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/build/osbuildcontroller_test.go`:
- Around line 895-1003: The test TestOSBuildControllerClearsStaleAnnotation
currently only starts the OSBuildController so the node controller's
getConfigAndBuild path never runs; either move this test into the node
controller package and drive it via syncMachineConfigPool or modify the test to
start the node controller as well (use the same setup helper that instantiates
the node controller) so the stale-annotation clearing logic in getConfigAndBuild
executes; additionally, reorder the narrative so the comment about "pool has
moved on to rendered-worker-3" matches when rendered-worker-3 is applied,
replace time.Sleep calls with kubeassert.Eventually assertions (e.g., verify
MachineOSBuildExists/length) to avoid flakiness, and stop indexing
apiMosc.Annotations directly—set the stale annotation using
metav1.SetMetaDataAnnotation(&apiMosc.ObjectMeta,
constants.CurrentMachineOSBuildAnnotationKey, firstMosb.Name) before updating to
handle nil maps safely.
---
Outside diff comments:
In `@pkg/controller/node/node_controller.go`:
- Around line 1178-1208: getConfigAndBuild currently clears a stale annotation
by updating a cache-backed ourConfig (from ctrl.moscLister) with context.TODO()
and no retry, which can cause 409s, blocks, and returns the old object; change
the logic so you detect a stale annotation but do not perform an immediate write
on the lister object: instead re-get a fresh MachineOSConfig from the API,
perform the delete on that copy, and update using
client.MachineconfigurationV1().MachineOSConfigs().Update inside
clientretry.RetryOnConflict with a context derived from the caller (add a
timeout/deadline), and after a successful update return the updated config (or
signal the staleAnnotation to syncMachineConfigPool to perform the clear); refer
to getConfigAndBuild, ourConfig, configCopy,
ctrl.client.MachineconfigurationV1().MachineOSConfigs().Update, ctrl.moscLister
and syncMachineConfigPool when making these changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9f74fdc9-4f68-4809-8964-c584ef0575dc
📒 Files selected for processing (2)
pkg/controller/build/osbuildcontroller_test.gopkg/controller/node/node_controller.go
|
|
||
| // TestOSBuildControllerClearsStaleAnnotation validates that when a MachineOSConfig | ||
| // has a stale current-machine-os-build annotation, the node controller clears it | ||
| // and allows a new build to be created. This tests the fix for the bug where pools | ||
| // get stuck at 0 machines configured with "Image is not ready yet" errors. | ||
| // | ||
| // Test scenario: | ||
| // 1. Create initial successful build for rendered-worker-1 | ||
| // 2. Manually create a stale annotation pointing to the old build | ||
| // 3. Apply a layer-only MachineConfig change (e.g., SSH key) -> rendered-worker-2 | ||
| // 4. Verify stale annotation is cleared | ||
| // 5. Verify new build is created for rendered-worker-2 | ||
| func TestOSBuildControllerClearsStaleAnnotation(t *testing.T) { | ||
| ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) | ||
| t.Cleanup(cancel) | ||
| poolName := "worker" | ||
|
|
||
| // Setup: Create initial successful build | ||
| _, mcfgclient, _, _, mosc, firstMosb, mcp, _, kubeassert := setupOSBuildControllerForTestWithSuccessfulBuild(ctx, t, poolName) | ||
|
|
||
| // Verify first build exists and gets annotation | ||
| isMachineOSBuildReachedExpectedCount(ctx, t, mcfgclient, mosc, 1) | ||
| assertMachineOSConfigGetsCurrentBuildAnnotation(ctx, t, mcfgclient, mosc, firstMosb) | ||
|
|
||
| t.Logf("Initial build %s created successfully for %s", firstMosb.Name, mosc.Name) | ||
|
|
||
| // Now apply a layer-only change (SSH key addition, doesn't need image rebuild) | ||
| // This will create a new rendered config: rendered-worker-2 | ||
| insertNewRenderedMachineConfigWithoutImageChangeAndUpdatePool(ctx, t, mcfgclient, mcp.Name, "rendered-worker-2") | ||
|
|
||
| // Give the controller time to process | ||
| time.Sleep(300 * time.Millisecond) | ||
|
|
||
| // At this point, a new MOSB should be created for the layer-only change | ||
| // The annotation should point to the new build, not the old one | ||
| mosbList, err := mcfgclient.MachineconfigurationV1(). | ||
| MachineOSBuilds(). | ||
| List(ctx, metav1.ListOptions{LabelSelector: utils.MachineOSBuildForPoolSelector(mosc).String()}) | ||
| require.NoError(t, err) | ||
| require.Len(t, mosbList.Items, 2, "expected a new MOSB to be created for layering-only change") | ||
|
|
||
| // Find the new build (not the first one) | ||
| var secondMosb *mcfgv1.MachineOSBuild | ||
| for i := range mosbList.Items { | ||
| if mosbList.Items[i].Name != firstMosb.Name { | ||
| secondMosb = &mosbList.Items[i] | ||
| break | ||
| } | ||
| } | ||
| require.NotNil(t, secondMosb, "second MOSB should exist") | ||
| assert.Equal(t, "rendered-worker-2", secondMosb.Spec.MachineConfig.Name, "second MOSB should be for rendered-worker-2") | ||
|
|
||
| t.Logf("Second build %s created for new rendered config %s", secondMosb.Name, secondMosb.Spec.MachineConfig.Name) | ||
|
|
||
| // Now simulate the bug: Manually set a stale annotation pointing to the old build | ||
| // while the pool has moved on to rendered-worker-3 | ||
| apiMosc, err := mcfgclient.MachineconfigurationV1().MachineOSConfigs().Get(ctx, mosc.Name, metav1.GetOptions{}) | ||
| require.NoError(t, err) | ||
|
|
||
| // Set stale annotation pointing to first build | ||
| apiMosc.Annotations[constants.CurrentMachineOSBuildAnnotationKey] = firstMosb.Name | ||
| apiMosc, err = mcfgclient.MachineconfigurationV1().MachineOSConfigs().Update(ctx, apiMosc, metav1.UpdateOptions{}) | ||
| require.NoError(t, err) | ||
| t.Logf("Manually set stale annotation to %s", firstMosb.Name) | ||
|
|
||
| // Apply another layer-only change -> rendered-worker-3 | ||
| // This should trigger the stale annotation detection and cleanup | ||
| insertNewRenderedMachineConfigWithoutImageChangeAndUpdatePool(ctx, t, mcfgclient, mcp.Name, "rendered-worker-3") | ||
|
|
||
| // Give the controller time to detect stale annotation and create new build | ||
| time.Sleep(500 * time.Millisecond) | ||
|
|
||
| // Verify that: | ||
| // 1. The stale annotation was cleared or updated to point to a new build | ||
| // 2. A new MOSB was created for rendered-worker-3 | ||
| apiMosc, err = mcfgclient.MachineconfigurationV1().MachineOSConfigs().Get(ctx, mosc.Name, metav1.GetOptions{}) | ||
| require.NoError(t, err) | ||
|
|
||
| currentAnnotation := apiMosc.Annotations[constants.CurrentMachineOSBuildAnnotationKey] | ||
| assert.NotEqual(t, firstMosb.Name, currentAnnotation, | ||
| "stale annotation pointing to first build should be cleared/updated") | ||
|
|
||
| mosbList, err = mcfgclient.MachineconfigurationV1(). | ||
| MachineOSBuilds(). | ||
| List(ctx, metav1.ListOptions{LabelSelector: utils.MachineOSBuildForPoolSelector(mosc).String()}) | ||
| require.NoError(t, err) | ||
| require.Len(t, mosbList.Items, 3, "expected third MOSB to be created after stale annotation cleanup") | ||
|
|
||
| // Find the third build for rendered-worker-3 | ||
| var thirdMosb *mcfgv1.MachineOSBuild | ||
| for i := range mosbList.Items { | ||
| if mosbList.Items[i].Spec.MachineConfig.Name == "rendered-worker-3" { | ||
| thirdMosb = &mosbList.Items[i] | ||
| break | ||
| } | ||
| } | ||
| require.NotNil(t, thirdMosb, "third MOSB should exist for rendered-worker-3") | ||
| t.Logf("Third build %s created successfully after stale annotation cleanup", thirdMosb.Name) | ||
|
|
||
| // Verify job was NOT created for layer-only builds | ||
| for _, mosb := range mosbList.Items { | ||
| if mosb.Name != firstMosb.Name { | ||
| jobName := utils.GetBuildJobName(&mosb) | ||
| kubeassert.JobDoesNotExist(jobName, "layering-only MOSB %s should not spawn a build Job", mosb.Name) | ||
| } | ||
| } | ||
|
|
||
| t.Logf("Test passed: stale annotation was cleaned up and new builds were created correctly") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm that startController only wires the OSBuildController and does not bring up the node controller
# (which would mean the new node-controller clearing logic runs in this test).
rg -nP -C3 'func startController\(' pkg/controller/build/osbuildcontroller_test.go
rg -nP -C3 'newOSBuildController\(' pkg/controller/build/
# And confirm no node.Controller is instantiated from the build test helpers.
rg -nP -C2 '"github.com/openshift/machine-config-operator/pkg/controller/node"' pkg/controller/build/Repository: openshift/machine-config-operator
Length of output: 2302
🏁 Script executed:
# 1. Verify stale-annotation clearing logic exists in node controller
rg -nP 'getConfigAndBuild|CurrentMachineOSBuildAnnotationKey' pkg/controller/node/node_controller.go | head -20
# 2. Check if node controller does remove/clear the stale annotation
rg -nP -B5 -A5 'Delete|clear|stale' pkg/controller/node/node_controller.go | grep -A5 -B5 'Annotation'Repository: openshift/machine-config-operator
Length of output: 2834
🏁 Script executed:
# 3. Verify build reconciler's updateMachineOSConfigStatus updates annotation
rg -nP -A30 'func.*updateMachineOSConfigStatus' pkg/controller/build/reconciler.goRepository: openshift/machine-config-operator
Length of output: 1442
🏁 Script executed:
# 4. Check actual line numbers and patterns in the test
sed -n '945,970p' pkg/controller/build/osbuildcontroller_test.go
# 5. Check kubeassert.Eventually usage patterns in the test file
rg -nP 'kubeassert.Eventually|time.Sleep' pkg/controller/build/osbuildcontroller_test.go | head -20Repository: openshift/machine-config-operator
Length of output: 2376
Test does not exercise the node-controller fix it claims to cover.
This test only starts the OSBuildController via startController; the node controller is not instantiated. The stale-annotation clearing logic (line 1203 in pkg/controller/node/node_controller.go's getConfigAndBuild) never runs. Instead, the assertion at line 977 (assert.NotEqual(firstMosb.Name, currentAnnotation)) passes because the build reconciler's updateMachineOSConfigStatus sets the annotation to the new build's name when reconciling the MOSB for rendered-worker-3. If the node-controller change in this PR were reverted, this test would still pass, so it does not guard the regression.
Please either:
- Move this test to
pkg/controller/node/...and drive it throughsyncMachineConfigPoolsogetConfigAndBuildruns, or - In this test, additionally start the node controller (via setup helper or direct instantiation) so its clearing logic executes, and assert that the annotation is observed as absent between the manual stale-set and the new MOSB creation.
Additional issues while you refactor:
- Line 950 comment says "the pool has moved on to rendered-worker-3" but
rendered-worker-3is applied on line 962, after the stale annotation is set. Reflow so narrative matches code order. - Lines 926 and 965 use
time.Sleep(300ms)andtime.Sleep(500ms). The rest of this file useskubeassert.Eventually()for similar waits (see lines 496, 566, 681, 722). Prefer eventual assertions (e.g.,kubeassert.Eventually(...).MachineOSBuildExists(...)for the expected MOSB) to avoid CI flakes. - Line 955 directly indexes
apiMosc.Annotations[...]without a nil check. Usemetav1.SetMetaDataAnnotation(&apiMosc.ObjectMeta, ...)instead, which is safer if the fixture changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/build/osbuildcontroller_test.go` around lines 895 - 1003, The
test TestOSBuildControllerClearsStaleAnnotation currently only starts the
OSBuildController so the node controller's getConfigAndBuild path never runs;
either move this test into the node controller package and drive it via
syncMachineConfigPool or modify the test to start the node controller as well
(use the same setup helper that instantiates the node controller) so the
stale-annotation clearing logic in getConfigAndBuild executes; additionally,
reorder the narrative so the comment about "pool has moved on to
rendered-worker-3" matches when rendered-worker-3 is applied, replace time.Sleep
calls with kubeassert.Eventually assertions (e.g., verify
MachineOSBuildExists/length) to avoid flakiness, and stop indexing
apiMosc.Annotations directly—set the stale annotation using
metav1.SetMetaDataAnnotation(&apiMosc.ObjectMeta,
constants.CurrentMachineOSBuildAnnotationKey, firstMosb.Name) before updating to
handle nil maps safely.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/node/node_controller.go`:
- Around line 1195-1207: The code currently treats absence from the mosbLister
as deleted and clears the build annotation; instead, before deleting the
annotation in the staleAnnotation branch (symbols: staleAnnotation, ourConfig,
currentBuildName, and the Update via
ctrl.client.MachineconfigurationV1().MachineOSConfigs().Update), perform a live
API GET for the MachineOSBuild
(ctrl.client.MachineconfigurationV1().MachineOSBuilds().Get) with
currentBuildName to confirm it truly does not exist; only if the Get returns a
NotFound should you proceed to delete the annotation and call Update, otherwise
skip clearing (or requeue on transient GET errors) so transient informer
ordering won’t clear a valid build reference.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a7722d06-3794-4589-a2e3-45fe1ccb3732
📒 Files selected for processing (3)
pkg/controller/build/osbuildcontroller_test.gopkg/controller/build/reconciler.gopkg/controller/node/node_controller.go
✅ Files skipped from review due to trivial changes (1)
- pkg/controller/build/osbuildcontroller_test.go
Remove verbose debugging logs added during development and keep production-appropriate logging levels. The fix logic remains unchanged.
cdcb8ad to
e5391fc
Compare
Before clearing stale annotation, verify with live API call that the build truly doesn't exist. This prevents incorrectly clearing a valid annotation when MOSC informer cache updates before MOSB informer cache. Addresses code review feedback from CodeRabbit.
|
LGTM |
|
@dkhater-redhat: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
When a MachineOSConfig's current-machine-os-build annotation points to:
The node controller would log warnings but not clear the annotation. This prevented the build controller from creating a new build for the current rendered config, causing pools to get stuck at 0 machines configured with the error 'Image is not ready yet'.
Fix: actively clear the stale annotation when detected. This allows the build controller to see the annotation is missing and trigger creation of a new MachineOSBuild for the current rendered config.
Fixes: OCPBUGS-XXXXX - MachineOSConfig 'build was not found' error
- What I did
- How to verify it
- Description for the changelog
Summary by CodeRabbit
Bug Fixes
Improvements