Skip to content

feat(driver): aggregate Machine.Status.Conditions into cluster status_reason#1020

Open
Rico Lin (ricolin) wants to merge 4 commits into
mainfrom
fix-driver-aggregate-machine-conditions
Open

feat(driver): aggregate Machine.Status.Conditions into cluster status_reason#1020
Rico Lin (ricolin) wants to merge 4 commits into
mainfrom
fix-driver-aggregate-machine-conditions

Conversation

@ricolin

Copy link
Copy Markdown
Member

Problem

When a CAPI Machine fails — e.g. CAPO reports FloatingIPErrorReason on a Machine's APIServerIngressReady=False — the parent OpenStackCluster.Status can stay Ready=true while the cluster is effectively broken. The Magnum poller's existing status_reason aggregator only walks CAPI Cluster + OpenStackCluster events, so operators see an UNHEALTHY cluster with an empty status_reason and have to SSH into the management cluster to discover the actual failure.

Solution

Walk Machine.Status.Conditions for every Machine in the cluster's namespace and aggregate any False condition with a non-empty Reason into the existing status_reason string.

  • Failure-mode coverage: per-Machine APIServerIngressReady, Ready, BootstrapReady, InfrastructureReady, NodeHealthy — all surface here.
  • Fail-open: if listing Machines errors (missing CRD, RBAC denial, transient API failure), the helper returns an empty string so the existing event-based reason is unchanged. This guarantees the patch can never regress the current behaviour.

Changes

  • magnum_cluster_api/driver.py — new _machine_conditions_reason(...) helper, called from the existing _status_reason() aggregator.
  • magnum_cluster_api/tests/unit/test_driver.py — 4 new unit tests covering: no machines, machine with no conditions, machine with False conditions (aggregated), machine list error (returns empty).

Tests

pytest magnum_cluster_api/tests/unit/test_driver.py — all 12 driver tests pass.

Validation

End-to-end: created a cluster with a deliberately misconfigured floating-IP pool. CAPO marked the master Machine APIServerIngressReady=False with Reason=FloatingIPErrorReason. openstack coe cluster show then displayed:

status_reason: ... FloatingIPErrorReason: <CAPO message> ...

Operator can now diagnose the failure without leaving the OpenStack CLI.

Rico Lin (ricolin) and others added 2 commits May 9, 2026 16:34
…_reason

When a CAPI Machine fails (e.g. CAPO surfaces FloatingIPErrorReason on
APIServerIngressReady=False) the parent OpenStackCluster Status can stay
Ready=true while the cluster is effectively broken. The Magnum poller's
existing aggregator only walked CAPI Cluster + OpenStackCluster events,
so operators saw an UNHEALTHY cluster with an empty status_reason and
had to SSH into the management cluster to debug.

Add a Machine.Status.Conditions walk that aggregates any False condition
with a Reason into the existing status_reason string. Fail open (return
empty) on listing errors so a missing CRD or RBAC issue cannot regress
the existing event-based reason.

Refs: bm-public-net.fullworkflow.md §13.14 (item 14-L2).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Rico Lin <rlin@vexxhost.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Rico Lin <rlin@vexxhost.com>
@ricolin Rico Lin (ricolin) force-pushed the fix-driver-aggregate-machine-conditions branch from 2d368e0 to ca491ff Compare May 9, 2026 08:34
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Rico Lin <rlin@vexxhost.com>
@ricolin

Copy link
Copy Markdown
Member Author

@codex[agent] review this PR

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76a8fd15ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread magnum_cluster_api/driver.py Outdated
return ""

messages = []
for machine in machines:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Broaden the fail-open guard around Machine iteration

When the Machine list cannot be read because the CRD is absent, RBAC denies it, or the API call times out, pykube raises while evaluating the lazy query during iteration, not necessarily when objects(...).filter(...) is constructed. Since this for loop is outside the try, those real listing failures can now abort update_cluster_status/delete polling instead of preserving the existing event-based status reason as intended; materialize or iterate the query inside the guarded block.

Useful? React with 👍 / 👎.

Co-authored-by: ricolin <7250045+ricolin@users.noreply.github.com>
@Codex

Codex AI commented May 14, 2026

Copy link
Copy Markdown

@codex[agent] review this PR

Reviewed and tightened the fail-open behavior: _get_machine_conditions_reason now returns empty on both list and iteration errors, matching the PR’s “missing CRD/RBAC/etc won’t regress” claim. Added a unit test for iteration-time failures. Commit: fb6406a.

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@ricolin Rico Lin (ricolin) force-pushed the fix-driver-aggregate-machine-conditions branch from fb6406a to 231a9ac Compare June 16, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants