OCPBUGS-83826: deploy-from-self when skopeo < 1.22.2#5867
OCPBUGS-83826: deploy-from-self when skopeo < 1.22.2#5867QiWang19 wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds cached host Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Daemon
participant Host as Host (skopeo/rpm-ostree)
participant Registry as Image Registry
participant Updater as In-place Updater
Daemon->>Host: Check NodeUpdaterClient.IsNewEnoughForLayering()
alt host not new enough
Daemon->>Updater: Force privileged in-place update
Updater-->>Daemon: Start update and schedule reboot
else host new enough
Daemon->>Host: skopeoSupportsMultiArchSigstore(imageURL)?
alt skopeo >= 1.22.2
Host-->>Daemon: supported
Daemon->>Updater: Proceed with layering in-place update
else skopeo older
Daemon->>Registry: skopeo inspect --raw docker://<image> (timeout)
Registry-->>Daemon: manifest JSON (mediaType)
alt mediaType is manifest-list or oci/index
Daemon-->>Host: treat as multi-arch (support)
Daemon->>Updater: Proceed with layering in-place update
else
Daemon->>Updater: Force privileged in-place update
Updater-->>Daemon: Start update and schedule reboot
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@QiWang19: This pull request references Jira Issue OCPBUGS-83826, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test all |
|
@QiWang19: This pull request references Jira Issue OCPBUGS-83826, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/daemon/update.go`:
- Around line 2972-2983: The code currently strips everything after the first
'-' from versionStr which converts prerelease versions like "1.22.2-rc1" into
the stable "1.22.2"; stop removing the '-' portion so semver.NewVersion parses
full prerelease identifiers (remove the dashIdx trimming), then use
semver.NewVersion(versionStr) and compare against
minSkopeoVersionForMultiArchSigstore via
skopeoVersion.Compare(*semver.New(minSkopeoVersionForMultiArchSigstore)) to set
skopeoMultiArchSigstoreSupportedValue; if parsing fails for distro
package-release suffixes (e.g., "1.22.2-3.el9"), attempt a conservative
secondary parse by stripping only a numeric package-release pattern (e.g.,
"-<digits>(..*)?$") before calling semver.NewVersion so you still reject true
prerelease identifiers while tolerating distro release metadata referenced by
updateLayeredOS and RunFirstbootCompleteMachineconfig.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ea8b93d1-1bbf-4dda-8529-764ff5ad88b2
📒 Files selected for processing (2)
pkg/daemon/daemon.gopkg/daemon/update.go
485b74b to
e9c2dce
Compare
|
/test all |
e9c2dce to
b5cf5df
Compare
|
/test all |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/daemon/update.go (1)
2973-2985:⚠️ Potential issue | 🟠 MajorDon't normalize prerelease skopeo builds to stable.
Line 2975 strips everything after the first
-, so1.22.2-rc1/1.22.2-devbecome1.22.2and incorrectly report support. That can skip the container fallback here and inpkg/daemon/daemon.go:1241, reintroducing the multi-arch Sigstore failure on prerelease builds. Please distinguish true prereleases from distro package-release suffixes instead of truncating all suffixes.Suggested fix
- // strip off any potential suffix, we only need to compare the major version - versionStr := fields[2] - if dashIdx := strings.Index(versionStr, "-"); dashIdx != -1 { - versionStr = versionStr[:dashIdx] - } - skopeoVersion, err := semver.NewVersion(versionStr) + versionStr := fields[2] + baseVersion := versionStr + prereleaseLike := false + if dashIdx := strings.Index(versionStr, "-"); dashIdx != -1 { + baseVersion = versionStr[:dashIdx] + suffix := strings.ToLower(versionStr[dashIdx+1:]) + prereleaseLike = strings.HasPrefix(suffix, "dev") || + strings.HasPrefix(suffix, "rc") || + strings.HasPrefix(suffix, "alpha") || + strings.HasPrefix(suffix, "beta") + } + skopeoVersion, err := semver.NewVersion(baseVersion) if err != nil { - klog.Errorf("failed to parse skopeo version %s: %v", versionStr, err) + klog.Errorf("failed to parse skopeo version %s: %v", baseVersion, err) skopeoMultiArchSigstoreSupportedValue = false return } minSkopeoVersionForMultiArchSigstore := "1.22.2" - skopeoMultiArchSigstoreSupportedValue = skopeoVersion.Compare(*semver.New(minSkopeoVersionForMultiArchSigstore)) >= 0 + cmp := skopeoVersion.Compare(*semver.New(minSkopeoVersionForMultiArchSigstore)) + skopeoMultiArchSigstoreSupportedValue = cmp > 0 || (cmp == 0 && !prereleaseLike)For github.com/coreos/go-semver/semver, how are `1.22.2-rc1`, `1.22.2-dev`, and `1.22.2-3.el9` compared against `1.22.2`? Specifically, should prerelease builds be treated as older than `1.22.2` while distro package-release suffixes are ignored for a minimum-version gate?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/daemon/update.go` around lines 2973 - 2985, The code currently strips any suffix after the first '-' which turns prerelease builds like 1.22.2-rc1 into 1.22.2; instead, change the logic in pkg/daemon/update.go around versionStr/skopeoVersion so that you only remove a trailing package-release suffix (e.g. a pure numeric package iteration like "-3" or common distro patterns such as "-3.el9" / "-1.fc34") but do not remove true semver prerelease identifiers (like "-rc1" or "-dev"); implement this by examining the substring after the first '-' and only truncating when it matches a package-release regex (digits with optional dot/alpha segments typical of distro releases), otherwise keep the suffix and pass the full string to semver.NewVersion so prereleases are parsed and compared properly against minSkopeoVersionForMultiArchSigstore when setting skopeoMultiArchSigstoreSupportedValue.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/daemon/update.go`:
- Around line 2973-2985: The code currently strips any suffix after the first
'-' which turns prerelease builds like 1.22.2-rc1 into 1.22.2; instead, change
the logic in pkg/daemon/update.go around versionStr/skopeoVersion so that you
only remove a trailing package-release suffix (e.g. a pure numeric package
iteration like "-3" or common distro patterns such as "-3.el9" / "-1.fc34") but
do not remove true semver prerelease identifiers (like "-rc1" or "-dev");
implement this by examining the substring after the first '-' and only
truncating when it matches a package-release regex (digits with optional
dot/alpha segments typical of distro releases), otherwise keep the suffix and
pass the full string to semver.NewVersion so prereleases are parsed and compared
properly against minSkopeoVersionForMultiArchSigstore when setting
skopeoMultiArchSigstoreSupportedValue.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: c107fb66-a7aa-4a02-9e1c-6cb595b07e58
📒 Files selected for processing (2)
pkg/daemon/daemon.gopkg/daemon/update.go
✅ Files skipped from review due to trivial changes (1)
- pkg/daemon/daemon.go
|
@yuqi-zhang could you review? |
|
/test e2e-aws-ovn |
|
/verified later @QiWang19 The failure case of multi-arch will be verified post-merge. verified the older skopeo verison falls to the in-place update case on a non multi-arch cluster. |
|
@QiWang19: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-ovn |
|
e2e-aws-ovn seems unrelated to this pull: I've reported this internally, but am not yet sure what's going on there. |
|
Still working on the /test images |
yuqi-zhang
left a comment
There was a problem hiding this comment.
Main concerns inline about the amount of affected clusters that would need double reboots
Not sure about the ovnWaiting for status to be reported — Waiting for pipeline condition to trigger this job:
That's a general requirement across openshift to move to pipeline stages. We were just an early adopter. Unit tests will always run, but e2es would only run if lgtm has been applied. Of course you can still trigger them normally as you did @wking
Still working on the No package matches 'nmstate >= 2.2.10' issue. Check with fresh runs:
I think that should be good now via openshift/release#78290
| return | ||
| } | ||
|
|
||
| // strip off any potential suffix, we only need to compare the major version |
There was a problem hiding this comment.
I think this strips it to e.g. 1.21.0? which isn't just the major version (unless I'm mis-interpreting something here in the logic)
There was a problem hiding this comment.
I think we can drop the strip off suffix code as the skopoe official repo does not release version with dash suffix
| logSystem("rpm-ostree is not new enough for layering; forcing an update via container") | ||
| // If Skopeo used by rpm-ostree is an older version with issue supporting multi-arch Sigstore verification, run as a privileged container which has updated skopeo. | ||
| // See https://redhat.atlassian.net/browse/OCPBUGS-83826 and https://redhat.atlassian.net/browse/OCPBUGS-81187 | ||
| if !newEnough || !skopeoSupportsMultiArchSigstore() { |
There was a problem hiding this comment.
I'm wondering if this is actually needed here. updateLayeredOS should just be doing OS updates in-cluster when we do an upgrade, which generally should have a new enough rpm-ostree (actually, in which release did we introduce the updated skopeo version? It looks like it's in 4.22 only? Should we try to backport that so older releases upgrading to this would already have the right rpm-ostree/skopeo?).
Basically, I think we should try to minimize the times we do a deploy-from-self operation to only when absolutely necessary. ( also see comment in the bootstrap case)
Side note, I think the comment above This currently will incur a double reboot is actually incorrect for the in-cluster case because we don't actually explicitly reboot here, but just once as part of the normal update flow. So maybe it's fine to leave here now I think about it.
There was a problem hiding this comment.
Should we try to backport that so older releases upgrading to this would already have the right rpm-ostree/skopeo
skopeo fix is included in 1.22.2, OCP 4.21 is with skopeo 1.22.0. I think we can backport the fix to skopeo and cut an upgrade edge for the OCP version that picks up the backport. We can clean up this workaround after the upgrade edge is added. I think we need this workaround given the 4.22 release timeline?
There was a problem hiding this comment.
rerun the above reproducer upgrade from 4.21.12(skopeo 1.22.0) multi-arch -> 4.22.0-rc.0 (skopeo 1.22.2) multi-arch, cluster upgrade succefully. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/release-openshift-origin-installer-launch-gcp-modern/2049222498062438400
| logSystem("rpm-ostree is not new enough for new-format image; forcing an update via container and queuing immediate reboot") | ||
| // If Skopeo used by rpm-ostree is an older version with issue supporting multi-arch Sigstore verification, run as a privileged container which has updated skopeo. | ||
| // See https://redhat.atlassian.net/browse/OCPBUGS-83826 and https://redhat.atlassian.net/browse/OCPBUGS-81187 | ||
| if !newEnough || !skopeoSupportsMultiArchSigstore() { |
There was a problem hiding this comment.
I'm wondering if we can only target cases where this would be broken, i.e. multi-arch only? Unlike the in-cluster case this actually does incur a double reboot which increases node join time, which is not great in the general case if it can be avoided. If only 4.22 has the right version this means that all clusters not having bootimage updates on by default would have to double reboot to join nodes, which seems excessive.
|
/test images |
|
/test e2e-aws-ovn |
b5cf5df to
34856ed
Compare
|
@QiWang19: This pull request references Jira Issue OCPBUGS-83826, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-ovn |
|
It's stale with the recent push, but the previous commit's e2e-aws-ovn run had gather-extra node logs like: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/5867/pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn/2047508480398462976/artifacts/e2e-aws-ovn/gather-extra/artifacts/nodes/ip-10-0-0-77.us-west-2.compute.internal/journal | zgrep -1 skopeo
Apr 24 03:41:54.106135 ip-10-0-0-77 silly_poincare[1914]: I0424 03:41:54.106103 1916 command_runner.go:24] Running captured: rpm-ostree --version
Apr 24 03:41:55.788183 ip-10-0-0-77 silly_poincare[1914]: I0424 03:41:55.788109 1916 update.go:2965] skopeo version output: skopeo version 1.22.0 commit: 1a41cccfa63d5bd923b57e85986041d7283e720c
Apr 24 03:41:55.788197 ip-10-0-0-77 silly_poincare[1914]: I0424 03:41:55.788140 1916 update.go:2994] "rpm-ostree or skopeo is not new enough for new-format image; forcing an update via container and queuing immediate reboot"
Apr 24 03:41:55.788836 ip-10-0-0-77 podman[1902]: I0424 03:41:55.788109 1916 update.go:2965] skopeo version output: skopeo version 1.22.0 commit: 1a41cccfa63d5bd923b57e85986041d7283e720c
Apr 24 03:41:55.788836 ip-10-0-0-77 podman[1902]: I0424 03:41:55.788140 1916 update.go:2994] "rpm-ostree or skopeo is not new enough for new-format image; forcing an update via container and queuing immediate reboot"
Apr 24 03:41:55.827449 ip-10-0-0-77 root[2019]: machine-config-daemon[1916]: "rpm-ostree or skopeo is not new enough for new-format image; forcing an update via container and queuing immediate reboot"
Apr 24 03:41:55.827816 ip-10-0-0-77 silly_poincare[1914]: I0424 03:41:55.827776 1916 update.go:2907] Running: setenforce 0
--
Apr 24 03:43:29.462719 ip-10-0-0-77 podman[3168]: runc 4:1.4.0-2.el9 -> 4:1.4.2-1.el9_8
Apr 24 03:43:29.462719 ip-10-0-0-77 podman[3168]: skopeo 2:1.22.0-2.el9 -> 2:1.22.2-1.el9_8
Apr 24 03:43:29.462719 ip-10-0-0-77 podman[3168]: sssd 2.9.8-1.el9 -> 2.9.8-4.el9_8
--
Apr 24 03:43:29.462280 ip-10-0-0-77 vigorous_kilby[3189]: runc 4:1.4.0-2.el9 -> 4:1.4.2-1.el9_8
Apr 24 03:43:29.462286 ip-10-0-0-77 vigorous_kilby[3189]: skopeo 2:1.22.0-2.el9 -> 2:1.22.2-1.el9_8
Apr 24 03:43:29.462292 ip-10-0-0-77 vigorous_kilby[3189]: sssd 2.9.8-1.el9 -> 2.9.8-4.el9_8
--
Apr 24 03:44:14.918722 ip-10-0-0-77 podman[1239]: I0424 03:44:14.918628 1253 command_runner.go:24] Running captured: rpm-ostree --version
Apr 24 03:44:15.078762 ip-10-0-0-77 great_dubinsky[1251]: I0424 03:44:15.078681 1253 update.go:2965] skopeo version output: skopeo version 1.22.2 commit: 4db1c56e8003ad07759a4d899254643d445dbe04
Apr 24 03:44:15.078776 ip-10-0-0-77 great_dubinsky[1251]: I0424 03:44:15.078712 1253 daemon.go:1259] rpm-ostree has container feature
Apr 24 03:44:15.079021 ip-10-0-0-77 podman[1239]: I0424 03:44:15.078681 1253 update.go:2965] skopeo version output: skopeo version 1.22.2 commit: 4db1c56e8003ad07759a4d899254643d445dbe04
Apr 24 03:44:15.079021 ip-10-0-0-77 podman[1239]: I0424 03:44:15.078712 1253 daemon.go:1259] rpm-ostree has container featurewhich looks good to me (othere than the |
uses deploy-from-self when skopeo is actually older than 1.22.2 that has issue with multi-arch sigstore verification Signed-off-by: Qi Wang <qiwan@redhat.com>
34856ed to
30e3e49
Compare
multiarch check failed, which seems to be caused by no authfile passed to skopeo, rebased. /test e2e-aws-ovn |
|
/verified later @QiWang19 |
|
@QiWang19: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@yuqi-zhang PTAL |
|
@wking PTAL |
yuqi-zhang
left a comment
There was a problem hiding this comment.
Some more questions - the general idea looks better, thanks for making it conditional
| args = append(args, "--authfile", kubeletAuthFile) | ||
| } | ||
| args = append(args, "docker://"+imageURL) | ||
| cmd := exec.CommandContext(ctx, "skopeo", args...) |
There was a problem hiding this comment.
Want to check: this is using the current skopeo on-disk, which is of whatever bootimage version is in the bootimage (let's say for example, whatever we shipped in 4.13, if the cluster is not in a managed bootimage path).
Do you expect that inspect to fail at all due to similar problems that caused the initial failure, or any other changes in the meantime that would be incompatible with an old skopeo?
There was a problem hiding this comment.
skopeo inspect does not validate the signature the way skopeo copy does, so old skopeo inspect here should not cause the initial failure.
| // This currently will incur a double reboot; see https://github.com/coreos/rpm-ostree/issues/4018 | ||
| if !newEnough { | ||
| logSystem("rpm-ostree is not new enough for new-format image; forcing an update via container and queuing immediate reboot") | ||
| // If skopeo is < 1.22.2 on a multi-arch image, run as a privileged container which has updated skopeo. |
There was a problem hiding this comment.
I forgot if I asked this - does the rpm-ostree directly use the skopeo binary on-disk or does it use some vendored library, etc. ?
There was a problem hiding this comment.
https://redhat.atlassian.net/browse/OCPBUGS-81187?focusedCommentId=16667251 and https://redhat-internal.slack.com/archives/CB95J6R4N/p1775744756600979
The previous diagnosis pointed out the skopeo experimental-image-proxy subcommand used by rpm-ostree, which indicates rpm-ostree uses the skopeo on the host, but I don't know much about the code related to it.
| if err := json.Unmarshal(out, &manifest); err != nil { | ||
| return false, fmt.Errorf("failed to parse manifest for %s: %w", imageURL, err) | ||
| } | ||
| multiArch := manifest.MediaType == "application/vnd.docker.distribution.manifest.list.v2+json" || |
There was a problem hiding this comment.
Can you help me understand where these are defined and why these are guaranteed to be multi-arch?
There was a problem hiding this comment.
manifest.list in the mediaType means this is a manifest list. skopeo inspect --raw output the manifest, if it is multi arch images, skopeo returns a manifest list containing entries for each architecture.
mediaType of single arch image would be application/vnd.oci.image.manifest.v1+json. https://github.com/distribution/distribution/blob/v2.8.3/docs/spec/manifest-v2-2.md#manifest-list
|
|
||
| multiArch, err := isMultiArchImage(imageURL) | ||
| if err != nil { | ||
| klog.Warningf("Failed to determine if image %s is multi-arch: %v", imageURL, err) |
There was a problem hiding this comment.
Hmm, this would mean that if something went wrong we just blanket update skopeo and hope for the best (since this is early boot only, any logging doesn't actually help I think).
What real failure scenarios would we hit? And would a skopeo update + reboot help or make it worse?
There was a problem hiding this comment.
the logic is:
If skopeo version check errors, only do inplace deploy if image is multi-arch
If multi-arch check errors, only do inplace deploy if skopeo version < 1.22.2
If both error, fallback to do inplace deploy.
skopeo version check failure should be rare, possiblely caused by the skopeo on the host is broken, so skopeo update would help since it uses the skopeo from the container.
mutli arch check may fail on network issue or registry authication, in this case I think in place deploy may not help, it would also fail on the same network/authentication error. But falling back to in place deploy make sure we do not fail on the original sigstore verification failure when we can not determine the skopeo version or if image is multi-arch.
yuqi-zhang
left a comment
There was a problem hiding this comment.
Thanks for the clarifications - lgtm, if we can also validate through CI that's be good.
/lgtm
/hold
To let the tests run, but also give @wking a chance for any final feedback
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: QiWang19, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
@QiWang19: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |


use deploy-from-self when skopeo is older than 1.22.2 and the boot image is multi-arch, which causes skopeo sigstore to fail.
- What I did
- How to verify it
Can manually verify on a non multi-arch release the skopeo version check runs and triggers the in-place update with the order version of skopeo
4.22.0-0.nightly-2026-04-23-021021, #5867 cluster:
- Description for the changelog
Summary by CodeRabbit