Skip to content

fix: improve must-gather resilience for SOS and container log collection#142

Merged
openshift-merge-bot[bot] merged 3 commits into
openstack-k8s-operators:mainfrom
lmiccini:fix-must-gather-resilience
Jun 29, 2026
Merged

fix: improve must-gather resilience for SOS and container log collection#142
openshift-merge-bot[bot] merged 3 commits into
openstack-k8s-operators:mainfrom
lmiccini:fix-must-gather-resilience

Conversation

@lmiccini

Copy link
Copy Markdown
Contributor

Fix tar --one-top-level usage in gather_sos to use -C with a relative directory name, matching the fix already applied in gather_edpm_sos (a6cb8e1). The --one-top-level option expects a relative path, not an absolute one, which caused "must use a relative file name" errors.

Make SOS report decompression non-fatal: if tar fails, preserve the compressed archive instead of deleting it and leaving nothing.

Filter out containers in waiting state from log collection in gather_ctlplane_resources. Containers that have never started have no logs, and oc logs errors on them (e.g. "container frr is not valid for pod speaker-*" in MetalLB with FRR-K8s mode).

Fix tar --one-top-level usage in gather_sos to use -C with a relative
directory name, matching the fix already applied in gather_edpm_sos
(a6cb8e1). The --one-top-level option expects a relative path, not an
absolute one, which caused "must use a relative file name" errors.

Make SOS report decompression non-fatal: if tar fails, preserve the
compressed archive instead of deleting it and leaving nothing.

Filter out containers in waiting state from log collection in
gather_ctlplane_resources. Containers that have never started have no
logs, and oc logs errors on them (e.g. "container frr is not valid for
pod speaker-*" in MetalLB with FRR-K8s mode).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot requested review from bshewale and danpawlik June 29, 2026 11:03
tox 4.56.1 changed the default behavior for missing Python
interpreters from skip to fail, breaking CI when py36/py39 are
not available on the runner. Explicitly set skip_missing_interpreters
to restore the previous behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stuggi stuggi mentioned this pull request Jun 29, 2026
@stuggi stuggi requested a review from fmount June 29, 2026 12:58
The gather_sos script creates /var/tmp/sos-osp on the host (via chroot
/host), but then passes --tmp-dir=/var/tmp/sos-osp to sos report which
runs inside the toolbox container. The support-tools RUN label mounts
the host filesystem at /host, not /, so the directory is only accessible
at /host/var/tmp/sos-osp inside the container.

This has been broken since gather_sos was introduced — SOS reports have
never succeeded in CI because of this path mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@fmount fmount left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fmount

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 8679a70 into openstack-k8s-operators:main Jun 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants