Skip to content

OCPBUGS-84150: MachineOSConfig "build was not found" error#5878

Open
dkhater-redhat wants to merge 5 commits intoopenshift:mainfrom
dkhater-redhat:fix-mosconfig-stale-build-annotation
Open

OCPBUGS-84150: MachineOSConfig "build was not found" error#5878
dkhater-redhat wants to merge 5 commits intoopenshift:mainfrom
dkhater-redhat:fix-mosconfig-stale-build-annotation

Conversation

@dkhater-redhat
Copy link
Copy Markdown
Contributor

@dkhater-redhat dkhater-redhat commented Apr 23, 2026

When a MachineOSConfig's current-machine-os-build annotation points to:

  1. A build for an old rendered config (annotation is stale), OR
  2. A build that no longer exists (deleted/pruned)

The node controller would log warnings but not clear the annotation. This prevented the build controller from creating a new build for the current rendered config, causing pools to get stuck at 0 machines configured with the error 'Image is not ready yet'.

Fix: actively clear the stale annotation when detected. This allows the build controller to see the annotation is missing and trigger creation of a new MachineOSBuild for the current rendered config.

Fixes: OCPBUGS-XXXXX - MachineOSConfig 'build was not found' error

- What I did

- How to verify it

- Description for the changelog

Summary by CodeRabbit

  • Bug Fixes

    • Stale OS build annotations that point to outdated or missing builds are now detected and cleared automatically, allowing pools to recover and trigger new builds instead of failing silently.
  • Improvements

    • Missing-or-stale annotation cases are treated as recoverable (driving creation of new builds) and logging has been clarified for trigger, reuse, and creation flows.

When a MachineOSConfig's current-machine-os-build annotation points to:
1. A build for an old rendered config (annotation is stale), OR
2. A build that no longer exists (deleted/pruned)

The node controller would log warnings but not clear the annotation.
This prevented the build controller from creating a new build for the
current rendered config, causing pools to get stuck at 0 machines
configured with the error 'Image is not ready yet'.

Fix: actively clear the stale annotation when detected. This allows
the build controller to see the annotation is missing and trigger
creation of a new MachineOSBuild for the current rendered config.

Fixes: OCPBUGS-XXXXX - MachineOSConfig 'build was not found' error
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@dkhater-redhat: This pull request references Jira Issue OCPBUGS-84150, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When a MachineOSConfig's current-machine-os-build annotation points to:

  1. A build for an old rendered config (annotation is stale), OR
  2. A build that no longer exists (deleted/pruned)

The node controller would log warnings but not clear the annotation. This prevented the build controller from creating a new build for the current rendered config, causing pools to get stuck at 0 machines configured with the error 'Image is not ready yet'.

Fix: actively clear the stale annotation when detected. This allows the build controller to see the annotation is missing and trigger creation of a new MachineOSBuild for the current rendered config.

Fixes: OCPBUGS-XXXXX - MachineOSConfig 'build was not found' error

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@dkhater-redhat has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 43 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c0ddd4fa-f84c-4eb8-8d03-0dd5fb5a4d2e

📥 Commits

Reviewing files that changed from the base of the PR and between e0577e9 and 4a3dffe.

📒 Files selected for processing (2)
  • pkg/controller/build/reconciler.go
  • pkg/controller/node/node_controller.go

Walkthrough

Node controller now detects when a MachineOSConfig's current-machine-os-build annotation is stale (missing or referencing a different rendered config) and clears that annotation via an API update. The build reconciler treats a missing annotation as a create path to produce a new MachineOSBuild. A test file only gained a trailing newline.

Changes

Cohort / File(s) Summary
Node controller stale-annotation handling
pkg/controller/node/node_controller.go
Adds detection of stale current-machine-os-build annotations (missing build or mismatched rendered config). Clears the annotation on the MachineOSConfig via client Update and returns an error if the update fails; adjusts logging.
Reconciler: treat missing annotation as create path
pkg/controller/build/reconciler.go
When the MachineOSConfig annotation is empty, skip reconcileImageRebuild, mark needsImageRebuild=true, and proceed to create a new MachineOSBuild instead of returning an error; adds logging around decision and missing-build cases.
Test (whitespace only)
pkg/controller/build/osbuildcontroller_test.go
Only adds a trailing newline/blank line at EOF; no behavioral or test changes.

Sequence Diagram(s)

sequenceDiagram
  participant NodeController
  participant API_Server
  participant Reconciler
  participant BuildStore
  NodeController->>API_Server: Read MachineOSConfig (has current-machine-os-build)
  API_Server-->>NodeController: Return resource
  NodeController->>BuildStore: Lookup referenced MachineOSBuild
  alt Build missing or mismatched
    NodeController->>API_Server: Update MachineOSConfig (delete annotation)
    API_Server-->>NodeController: Update result
    NodeController->>Reconciler: Return error / signal missing annotation
    Reconciler->>API_Server: Read MachineOSConfig (annotation empty)
    API_Server-->>Reconciler: Return resource
    Reconciler->>Reconciler: Set needsImageRebuild = true
    Reconciler->>BuildStore: Create new MachineOSBuild
    BuildStore-->>Reconciler: Build created
  else Build exists and matches
    NodeController-->>Reconciler: Continue with existing build
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive PR changes focus on implementation code for stale annotation handling; test file modification is cosmetic without new Ginkgo test implementations to assess. Consider adding Ginkgo test cases validating stale annotation detection, annotation-missing handling, and successful annotation removal with proper setup, cleanup, timeouts, and assertion messages.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective: clearing stale MachineOSConfig build annotations to resolve the 'build was not found' error.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only non-test files (node_controller.go, reconciler.go) and adds trailing newline to osbuildcontroller_test.go. No test names were added or modified.
Microshift Test Compatibility ✅ Passed PR does not add any new Ginkgo e2e tests, only modifies non-test controller logic and adds a trailing newline.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add any new Ginkgo e2e tests. Changes consist of controller implementation updates and a trailing newline in a unit test file.
Topology-Aware Scheduling Compatibility ✅ Passed Pull request modifies only configuration and build tracking controllers with no scheduling constraints, affinity rules, topology constraints, or deployment manifests introduced.
Ote Binary Stdout Contract ✅ Passed Modified files are controller libraries with no main/init/suite functions; logging uses klog without direct stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. Changes are limited to formatting updates and production code logic modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dkhater-redhat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2026
Test validates that when a MachineOSConfig has a stale
current-machine-os-build annotation (pointing to a build for an old
rendered config or a non-existent build), the annotation is properly
cleared to allow the build controller to create a new build.

Test covers four scenarios:
1. Stale annotation - build exists but for old rendered config
2. Stale annotation - build does not exist
3. Valid annotation - build exists for current rendered config
4. No annotation - fallback to matching build
@dkhater-redhat dkhater-redhat force-pushed the fix-mosconfig-stale-build-annotation branch from 2109ca1 to 7b612b4 Compare April 24, 2026 00:06
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/controller/node/node_controller.go (1)

1178-1208: ⚠️ Potential issue | 🟠 Major

Getter now mutates cluster state without retry-on-conflict or fresh API fetch.

Several concerns about this block:

  1. Unretried write on a cached object. ourConfig originates from ctrl.moscLister, so configCopy carries a cache-local ResourceVersion. A concurrent update by the build reconciler (e.g., at pkg/controller/build/reconciler.go:339) will trigger a 409 Conflict. Unlike Node mutations in this same file (updateCandidateNode, setUpdateInProgressTaint — lines 1478, 1799, 1831), which all use clientretry.RetryOnConflict, this MOSC write fails immediately. Even the build reconciler re-fetches via lister before its MOSC Update. Re-Get from the API and wrap with RetryOnConflict to handle concurrent writers.

  2. Read path now mutates. getConfigAndBuild is called in a hot read path (getConfigAndBuildAndLayeredStatussyncMachineConfigPool); any Update failure prevents the caller from seeing the mosc/mosb. Consider returning an additional staleAnnotation signal and letting syncMachineConfigPool (or a dedicated helper) perform the clear, so lookup failures and clear failures are clearly distinguishable.

  3. Convoluted flag. staleAnnotation is set in two branches and then re-checked via !staleAnnotation to detect "build name not found at all." A foundBuild local and a single post-check is clearer.

  4. context.TODO() with no timeout. A deadline-bounded context derived from the sync would prevent a slow API server from stalling the worker.

  5. Returned ourConfig still has the old annotation in memory. After a successful Update you return the unmodified ourConfig, not configCopy. Harmless for current callers (they don't re-persist it), but easy to miss if a future caller reads the annotation again on the same pass.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/node/node_controller.go` around lines 1178 - 1208,
getConfigAndBuild currently clears a stale annotation by updating a cache-backed
ourConfig (from ctrl.moscLister) with context.TODO() and no retry, which can
cause 409s, blocks, and returns the old object; change the logic so you detect a
stale annotation but do not perform an immediate write on the lister object:
instead re-get a fresh MachineOSConfig from the API, perform the delete on that
copy, and update using client.MachineconfigurationV1().MachineOSConfigs().Update
inside clientretry.RetryOnConflict with a context derived from the caller (add a
timeout/deadline), and after a successful update return the updated config (or
signal the staleAnnotation to syncMachineConfigPool to perform the clear); refer
to getConfigAndBuild, ourConfig, configCopy,
ctrl.client.MachineconfigurationV1().MachineOSConfigs().Update, ctrl.moscLister
and syncMachineConfigPool when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/build/osbuildcontroller_test.go`:
- Around line 895-1003: The test TestOSBuildControllerClearsStaleAnnotation
currently only starts the OSBuildController so the node controller's
getConfigAndBuild path never runs; either move this test into the node
controller package and drive it via syncMachineConfigPool or modify the test to
start the node controller as well (use the same setup helper that instantiates
the node controller) so the stale-annotation clearing logic in getConfigAndBuild
executes; additionally, reorder the narrative so the comment about "pool has
moved on to rendered-worker-3" matches when rendered-worker-3 is applied,
replace time.Sleep calls with kubeassert.Eventually assertions (e.g., verify
MachineOSBuildExists/length) to avoid flakiness, and stop indexing
apiMosc.Annotations directly—set the stale annotation using
metav1.SetMetaDataAnnotation(&apiMosc.ObjectMeta,
constants.CurrentMachineOSBuildAnnotationKey, firstMosb.Name) before updating to
handle nil maps safely.

---

Outside diff comments:
In `@pkg/controller/node/node_controller.go`:
- Around line 1178-1208: getConfigAndBuild currently clears a stale annotation
by updating a cache-backed ourConfig (from ctrl.moscLister) with context.TODO()
and no retry, which can cause 409s, blocks, and returns the old object; change
the logic so you detect a stale annotation but do not perform an immediate write
on the lister object: instead re-get a fresh MachineOSConfig from the API,
perform the delete on that copy, and update using
client.MachineconfigurationV1().MachineOSConfigs().Update inside
clientretry.RetryOnConflict with a context derived from the caller (add a
timeout/deadline), and after a successful update return the updated config (or
signal the staleAnnotation to syncMachineConfigPool to perform the clear); refer
to getConfigAndBuild, ourConfig, configCopy,
ctrl.client.MachineconfigurationV1().MachineOSConfigs().Update, ctrl.moscLister
and syncMachineConfigPool when making these changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9f74fdc9-4f68-4809-8964-c584ef0575dc

📥 Commits

Reviewing files that changed from the base of the PR and between 5f1b7ad and 7b612b4.

📒 Files selected for processing (2)
  • pkg/controller/build/osbuildcontroller_test.go
  • pkg/controller/node/node_controller.go

Comment on lines +895 to +1003

// TestOSBuildControllerClearsStaleAnnotation validates that when a MachineOSConfig
// has a stale current-machine-os-build annotation, the node controller clears it
// and allows a new build to be created. This tests the fix for the bug where pools
// get stuck at 0 machines configured with "Image is not ready yet" errors.
//
// Test scenario:
// 1. Create initial successful build for rendered-worker-1
// 2. Manually create a stale annotation pointing to the old build
// 3. Apply a layer-only MachineConfig change (e.g., SSH key) -> rendered-worker-2
// 4. Verify stale annotation is cleared
// 5. Verify new build is created for rendered-worker-2
func TestOSBuildControllerClearsStaleAnnotation(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
t.Cleanup(cancel)
poolName := "worker"

// Setup: Create initial successful build
_, mcfgclient, _, _, mosc, firstMosb, mcp, _, kubeassert := setupOSBuildControllerForTestWithSuccessfulBuild(ctx, t, poolName)

// Verify first build exists and gets annotation
isMachineOSBuildReachedExpectedCount(ctx, t, mcfgclient, mosc, 1)
assertMachineOSConfigGetsCurrentBuildAnnotation(ctx, t, mcfgclient, mosc, firstMosb)

t.Logf("Initial build %s created successfully for %s", firstMosb.Name, mosc.Name)

// Now apply a layer-only change (SSH key addition, doesn't need image rebuild)
// This will create a new rendered config: rendered-worker-2
insertNewRenderedMachineConfigWithoutImageChangeAndUpdatePool(ctx, t, mcfgclient, mcp.Name, "rendered-worker-2")

// Give the controller time to process
time.Sleep(300 * time.Millisecond)

// At this point, a new MOSB should be created for the layer-only change
// The annotation should point to the new build, not the old one
mosbList, err := mcfgclient.MachineconfigurationV1().
MachineOSBuilds().
List(ctx, metav1.ListOptions{LabelSelector: utils.MachineOSBuildForPoolSelector(mosc).String()})
require.NoError(t, err)
require.Len(t, mosbList.Items, 2, "expected a new MOSB to be created for layering-only change")

// Find the new build (not the first one)
var secondMosb *mcfgv1.MachineOSBuild
for i := range mosbList.Items {
if mosbList.Items[i].Name != firstMosb.Name {
secondMosb = &mosbList.Items[i]
break
}
}
require.NotNil(t, secondMosb, "second MOSB should exist")
assert.Equal(t, "rendered-worker-2", secondMosb.Spec.MachineConfig.Name, "second MOSB should be for rendered-worker-2")

t.Logf("Second build %s created for new rendered config %s", secondMosb.Name, secondMosb.Spec.MachineConfig.Name)

// Now simulate the bug: Manually set a stale annotation pointing to the old build
// while the pool has moved on to rendered-worker-3
apiMosc, err := mcfgclient.MachineconfigurationV1().MachineOSConfigs().Get(ctx, mosc.Name, metav1.GetOptions{})
require.NoError(t, err)

// Set stale annotation pointing to first build
apiMosc.Annotations[constants.CurrentMachineOSBuildAnnotationKey] = firstMosb.Name
apiMosc, err = mcfgclient.MachineconfigurationV1().MachineOSConfigs().Update(ctx, apiMosc, metav1.UpdateOptions{})
require.NoError(t, err)
t.Logf("Manually set stale annotation to %s", firstMosb.Name)

// Apply another layer-only change -> rendered-worker-3
// This should trigger the stale annotation detection and cleanup
insertNewRenderedMachineConfigWithoutImageChangeAndUpdatePool(ctx, t, mcfgclient, mcp.Name, "rendered-worker-3")

// Give the controller time to detect stale annotation and create new build
time.Sleep(500 * time.Millisecond)

// Verify that:
// 1. The stale annotation was cleared or updated to point to a new build
// 2. A new MOSB was created for rendered-worker-3
apiMosc, err = mcfgclient.MachineconfigurationV1().MachineOSConfigs().Get(ctx, mosc.Name, metav1.GetOptions{})
require.NoError(t, err)

currentAnnotation := apiMosc.Annotations[constants.CurrentMachineOSBuildAnnotationKey]
assert.NotEqual(t, firstMosb.Name, currentAnnotation,
"stale annotation pointing to first build should be cleared/updated")

mosbList, err = mcfgclient.MachineconfigurationV1().
MachineOSBuilds().
List(ctx, metav1.ListOptions{LabelSelector: utils.MachineOSBuildForPoolSelector(mosc).String()})
require.NoError(t, err)
require.Len(t, mosbList.Items, 3, "expected third MOSB to be created after stale annotation cleanup")

// Find the third build for rendered-worker-3
var thirdMosb *mcfgv1.MachineOSBuild
for i := range mosbList.Items {
if mosbList.Items[i].Spec.MachineConfig.Name == "rendered-worker-3" {
thirdMosb = &mosbList.Items[i]
break
}
}
require.NotNil(t, thirdMosb, "third MOSB should exist for rendered-worker-3")
t.Logf("Third build %s created successfully after stale annotation cleanup", thirdMosb.Name)

// Verify job was NOT created for layer-only builds
for _, mosb := range mosbList.Items {
if mosb.Name != firstMosb.Name {
jobName := utils.GetBuildJobName(&mosb)
kubeassert.JobDoesNotExist(jobName, "layering-only MOSB %s should not spawn a build Job", mosb.Name)
}
}

t.Logf("Test passed: stale annotation was cleaned up and new builds were created correctly")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm that startController only wires the OSBuildController and does not bring up the node controller
# (which would mean the new node-controller clearing logic runs in this test).
rg -nP -C3 'func startController\(' pkg/controller/build/osbuildcontroller_test.go
rg -nP -C3 'newOSBuildController\(' pkg/controller/build/
# And confirm no node.Controller is instantiated from the build test helpers.
rg -nP -C2 '"github.com/openshift/machine-config-operator/pkg/controller/node"' pkg/controller/build/

Repository: openshift/machine-config-operator

Length of output: 2302


🏁 Script executed:

# 1. Verify stale-annotation clearing logic exists in node controller
rg -nP 'getConfigAndBuild|CurrentMachineOSBuildAnnotationKey' pkg/controller/node/node_controller.go | head -20

# 2. Check if node controller does remove/clear the stale annotation
rg -nP -B5 -A5 'Delete|clear|stale' pkg/controller/node/node_controller.go | grep -A5 -B5 'Annotation'

Repository: openshift/machine-config-operator

Length of output: 2834


🏁 Script executed:

# 3. Verify build reconciler's updateMachineOSConfigStatus updates annotation
rg -nP -A30 'func.*updateMachineOSConfigStatus' pkg/controller/build/reconciler.go

Repository: openshift/machine-config-operator

Length of output: 1442


🏁 Script executed:

# 4. Check actual line numbers and patterns in the test
sed -n '945,970p' pkg/controller/build/osbuildcontroller_test.go

# 5. Check kubeassert.Eventually usage patterns in the test file
rg -nP 'kubeassert.Eventually|time.Sleep' pkg/controller/build/osbuildcontroller_test.go | head -20

Repository: openshift/machine-config-operator

Length of output: 2376


Test does not exercise the node-controller fix it claims to cover.

This test only starts the OSBuildController via startController; the node controller is not instantiated. The stale-annotation clearing logic (line 1203 in pkg/controller/node/node_controller.go's getConfigAndBuild) never runs. Instead, the assertion at line 977 (assert.NotEqual(firstMosb.Name, currentAnnotation)) passes because the build reconciler's updateMachineOSConfigStatus sets the annotation to the new build's name when reconciling the MOSB for rendered-worker-3. If the node-controller change in this PR were reverted, this test would still pass, so it does not guard the regression.

Please either:

  • Move this test to pkg/controller/node/... and drive it through syncMachineConfigPool so getConfigAndBuild runs, or
  • In this test, additionally start the node controller (via setup helper or direct instantiation) so its clearing logic executes, and assert that the annotation is observed as absent between the manual stale-set and the new MOSB creation.

Additional issues while you refactor:

  • Line 950 comment says "the pool has moved on to rendered-worker-3" but rendered-worker-3 is applied on line 962, after the stale annotation is set. Reflow so narrative matches code order.
  • Lines 926 and 965 use time.Sleep(300ms) and time.Sleep(500ms). The rest of this file uses kubeassert.Eventually() for similar waits (see lines 496, 566, 681, 722). Prefer eventual assertions (e.g., kubeassert.Eventually(...).MachineOSBuildExists(...) for the expected MOSB) to avoid CI flakes.
  • Line 955 directly indexes apiMosc.Annotations[...] without a nil check. Use metav1.SetMetaDataAnnotation(&apiMosc.ObjectMeta, ...) instead, which is safer if the fixture changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/build/osbuildcontroller_test.go` around lines 895 - 1003, The
test TestOSBuildControllerClearsStaleAnnotation currently only starts the
OSBuildController so the node controller's getConfigAndBuild path never runs;
either move this test into the node controller package and drive it via
syncMachineConfigPool or modify the test to start the node controller as well
(use the same setup helper that instantiates the node controller) so the
stale-annotation clearing logic in getConfigAndBuild executes; additionally,
reorder the narrative so the comment about "pool has moved on to
rendered-worker-3" matches when rendered-worker-3 is applied, replace time.Sleep
calls with kubeassert.Eventually assertions (e.g., verify
MachineOSBuildExists/length) to avoid flakiness, and stop indexing
apiMosc.Annotations directly—set the stale annotation using
metav1.SetMetaDataAnnotation(&apiMosc.ObjectMeta,
constants.CurrentMachineOSBuildAnnotationKey, firstMosb.Name) before updating to
handle nil maps safely.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/node/node_controller.go`:
- Around line 1195-1207: The code currently treats absence from the mosbLister
as deleted and clears the build annotation; instead, before deleting the
annotation in the staleAnnotation branch (symbols: staleAnnotation, ourConfig,
currentBuildName, and the Update via
ctrl.client.MachineconfigurationV1().MachineOSConfigs().Update), perform a live
API GET for the MachineOSBuild
(ctrl.client.MachineconfigurationV1().MachineOSBuilds().Get) with
currentBuildName to confirm it truly does not exist; only if the Get returns a
NotFound should you proceed to delete the annotation and call Update, otherwise
skip clearing (or requeue on transient GET errors) so transient informer
ordering won’t clear a valid build reference.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a7722d06-3794-4589-a2e3-45fe1ccb3732

📥 Commits

Reviewing files that changed from the base of the PR and between 7b612b4 and e0577e9.

📒 Files selected for processing (3)
  • pkg/controller/build/osbuildcontroller_test.go
  • pkg/controller/build/reconciler.go
  • pkg/controller/node/node_controller.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/controller/build/osbuildcontroller_test.go

Comment thread pkg/controller/node/node_controller.go
Remove verbose debugging logs added during development and keep
production-appropriate logging levels. The fix logic remains unchanged.
@dkhater-redhat dkhater-redhat force-pushed the fix-mosconfig-stale-build-annotation branch from cdcb8ad to e5391fc Compare April 28, 2026 17:31
Before clearing stale annotation, verify with live API call that the
build truly doesn't exist. This prevents incorrectly clearing a valid
annotation when MOSC informer cache updates before MOSB informer cache.

Addresses code review feedback from CodeRabbit.
@umohnani8
Copy link
Copy Markdown
Contributor

LGTM

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

@dkhater-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-images 4a3dffe link true /test okd-scos-images
ci/prow/unit 4a3dffe link true /test unit
ci/prow/e2e-gcp-op-ocl-part2 4a3dffe link true /test e2e-gcp-op-ocl-part2
ci/prow/images 4a3dffe link true /test images
ci/prow/e2e-gcp-op-ocl-part1 4a3dffe link true /test e2e-gcp-op-ocl-part1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants