feat(driver): aggregate Machine.Status.Conditions into cluster status_reason#1020
feat(driver): aggregate Machine.Status.Conditions into cluster status_reason#1020Rico Lin (ricolin) wants to merge 4 commits into
Conversation
…_reason When a CAPI Machine fails (e.g. CAPO surfaces FloatingIPErrorReason on APIServerIngressReady=False) the parent OpenStackCluster Status can stay Ready=true while the cluster is effectively broken. The Magnum poller's existing aggregator only walked CAPI Cluster + OpenStackCluster events, so operators saw an UNHEALTHY cluster with an empty status_reason and had to SSH into the management cluster to debug. Add a Machine.Status.Conditions walk that aggregates any False condition with a Reason into the existing status_reason string. Fail open (return empty) on listing errors so a missing CRD or RBAC issue cannot regress the existing event-based reason. Refs: bm-public-net.fullworkflow.md §13.14 (item 14-L2). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Rico Lin <rlin@vexxhost.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Rico Lin <rlin@vexxhost.com>
2d368e0 to
ca491ff
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Rico Lin <rlin@vexxhost.com>
|
@codex[agent] review this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 76a8fd15ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
| return "" | ||
|
|
||
| messages = [] | ||
| for machine in machines: |
There was a problem hiding this comment.
Broaden the fail-open guard around Machine iteration
When the Machine list cannot be read because the CRD is absent, RBAC denies it, or the API call times out, pykube raises while evaluating the lazy query during iteration, not necessarily when objects(...).filter(...) is constructed. Since this for loop is outside the try, those real listing failures can now abort update_cluster_status/delete polling instead of preserving the existing event-based status reason as intended; materialize or iterate the query inside the guarded block.
Useful? React with 👍 / 👎.
Co-authored-by: ricolin <7250045+ricolin@users.noreply.github.com>
Reviewed and tightened the fail-open behavior: |
|
To use Codex here, create a Codex account and connect to github. |
fb6406a to
231a9ac
Compare
Problem
When a CAPI
Machinefails — e.g. CAPO reportsFloatingIPErrorReasonon a Machine'sAPIServerIngressReady=False— the parentOpenStackCluster.Statuscan stayReady=truewhile the cluster is effectively broken. The Magnum poller's existingstatus_reasonaggregator only walks CAPI Cluster + OpenStackCluster events, so operators see anUNHEALTHYcluster with an emptystatus_reasonand have to SSH into the management cluster to discover the actual failure.Solution
Walk
Machine.Status.Conditionsfor every Machine in the cluster's namespace and aggregate anyFalsecondition with a non-emptyReasoninto the existingstatus_reasonstring.APIServerIngressReady,Ready,BootstrapReady,InfrastructureReady,NodeHealthy— all surface here.Changes
magnum_cluster_api/driver.py— new_machine_conditions_reason(...)helper, called from the existing_status_reason()aggregator.magnum_cluster_api/tests/unit/test_driver.py— 4 new unit tests covering: no machines, machine with no conditions, machine withFalseconditions (aggregated), machine list error (returns empty).Tests
pytest magnum_cluster_api/tests/unit/test_driver.py— all 12 driver tests pass.Validation
End-to-end: created a cluster with a deliberately misconfigured floating-IP pool. CAPO marked the master Machine
APIServerIngressReady=FalsewithReason=FloatingIPErrorReason.openstack coe cluster showthen displayed:Operator can now diagnose the failure without leaving the OpenStack CLI.