fix(keystone_auth): emit Node,RBAC,Webhook authorization-mode and wait for healthz by ricolin · Pull Request #1018 · vexxhost/magnum-cluster-api

Rico Lin (ricolin) · 2026-05-09T03:00:23Z

Summary

Fixes enable_keystone_auth=true on CAPI-driven Magnum clusters. Without this PR, clusters that opt in to Keystone webhook authorization either silently bypass the webhook (so Keystone tokens never authorize) or fail to bootstrap entirely because the kube-apiserver static pod crash-loops at startup.

The change is a single commit touching src/features/keystone_auth.rs:

Set --authorization-mode=Node,RBAC,Webhook via a JSON-Patch op against the kustomize input.
Strip the kubeadm-default --authorization-mode=Node,RBAC line from the kustomize output before writing the manifest back, so kubelet only ever sees a single --authorization-mode line.
Wait for the k8s-keystone-auth Pod to become Ready before declaring post-kubeadm complete.

Background and analysis

1. Authorization mode never reached the apiserver

The pre-existing keystone_auth feature kustomized /etc/kubernetes/manifests/kube-apiserver.yaml to add the keystone webhook authorizer, but did not adjust --authorization-mode. The running apiserver therefore had only the kubeadm-default Node,RBAC modes, never consulted the webhook, and every Keystone-token authenticated request failed with RBAC: forbidden.

2. Duplicate `--authorization-mode` line crashed the apiserver

kubeadm always writes --authorization-mode=Node,RBAC into its apiserver static-pod manifest. The feature's kustomization runs after kubeadm, so simply appending --authorization-mode=Node,RBAC,Webhook would leave the manifest carrying both lines:

- --authorization-mode=Node,RBAC          # written by kubeadm
- --authorization-mode=Node,RBAC,Webhook  # appended by us

--authorization-mode is a pflag.StringSliceVar which appends repeated flag values rather than last-occurrence-wins. The apiserver process therefore receives ["Node","RBAC","Node","RBAC","Webhook"] and bails at startup with:

authorization-mode ["Node" "RBAC" "Node" "RBAC" "Webhook"]
  has mode specified more than once

kube-apiserver enters CrashLoopBackOff, KCP never reaches APIServerAvailable=true, the cluster fails NodeStartupTimeout. This is the failure mode every enable_keystone_auth=true cluster hits without this PR.

The fix here pipes the kustomize output through sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d' before writing the manifest, so the kubeadm-default line is removed and only the Node,RBAC,Webhook line remains.

3. Cluster bootstrap raced ahead of the keystone webhook deployment

The post-kubeadm command sequence proceeded as soon as the apiserver TCP port was open, but the k8s-keystone-auth Deployment was still rolling out. Webhook calls during that window returned connection refused, which the apiserver caches negatively and surfaces as authentication failures for several seconds after the webhook is actually up.

Fix: append kubectl wait --for=condition=Ready against the k8s-keystone-auth Pod to the post-kubeadm commands so the cluster does not register Ready until the webhook actually serves requests.

Validation

End-to-end on a CAPI baremetal cluster created with enable_keystone_auth=true:

$ kubectl -n kube-system get pod -l component=kube-apiserver
NAME                          READY   STATUS    RESTARTS   AGE
kube-apiserver-<master>       1/1     Running   0          8m44s

$ ssh master 'grep authorization-mode /etc/kubernetes/manifests/kube-apiserver.yaml'
    - --authorization-mode=Node,RBAC,Webhook        # exactly one occurrence

$ kubectl -n kube-system get pod -l app=k8s-keystone-auth
NAME                      READY   STATUS    RESTARTS   AGE
k8s-keystone-auth-XXXXX   1/1     Running   0          7m59s

KubeadmControlPlane reports APIServerAvailable=true.
Cluster reaches CREATE_COMPLETE cleanly.
Keystone-token authenticated kubectl requests succeed.

Two unit tests in src/features/keystone_auth.rs cover the patch emission and the post-kubeadm command pipeline; both pass with cargo test.

Notes

The sed filter is anchored on whole-line, leading-whitespace tolerant, and only matches the literal --authorization-mode=Node,RBAC value — it cannot accidentally remove a future Node,RBAC,Foo value or any line that merely contains the substring.
No new label, no new operator-facing knob; the feature is a behaviour fix gated by the existing enable_keystone_auth=true label.

…t for healthz Two independent issues prevented `enable_keystone_auth=true` from working end-to-end on baremetal CAPI clusters. 1. Authorization mode never reached the apiserver The `keystone_auth` feature kustomized `/etc/kubernetes/manifests/kube-apiserver.yaml` to add the keystone webhook authorizer, but did not adjust `--authorization-mode`. The running apiserver therefore had only the kubeadm-default `Node,RBAC` modes, never consulted the webhook, and every Keystone- token authenticated request failed with `RBAC: forbidden`. Fix: emit a JSON-Patch op that *replaces* (not appends) the apiserver `--authorization-mode` flag on the kustomize input. The replacement value is `Node,RBAC,Webhook`. Append-style ops are not used here because `--authorization-mode` is a `pflag.StringSliceVar` which appends repeated values rather than last-occurrence-wins, so a stray duplicate would be fatal — see point 2 below. 2. Duplicate `--authorization-mode` line crashed the apiserver kubeadm always writes `--authorization-mode=Node,RBAC` into its apiserver static-pod manifest. The feature's kustomization runs AFTER kubeadm, so even after fixing point 1 above, the live manifest carries BOTH lines: - --authorization-mode=Node,RBAC # written by kubeadm - --authorization-mode=Node,RBAC,Webhook # added by us `--authorization-mode` is a `pflag.StringSliceVar` which appends repeated values; the apiserver process therefore sees the merged list `["Node","RBAC","Node","RBAC","Webhook"]` and bails at startup with: authorization-mode ["Node" "RBAC" "Node" "RBAC" "Webhook"] has mode specified more than once `kube-apiserver` enters CrashLoopBackOff, the KubeadmControlPlane never reaches `APIServerAvailable=true`, and the cluster eventually fails the NodeStartupTimeout window. Fix: pipe the kustomize output through `sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d'` before writing the manifest back. The kubeadm-default line is stripped before kubelet picks the file up; the kubelet only ever sees the de-duped manifest with a single `--authorization-mode=Node,RBAC,Webhook` line. 3. Cluster bootstrap raced ahead of the keystone webhook deployment The post-kubeadm command sequence proceeded as soon as the apiserver TCP port was open, but the keystone webhook Deployment was still rolling out. Webhook calls during that window returned `connection refused`, which the apiserver caches negatively and surfaces as authentication failures for several seconds after the webhook is actually up. Fix: append a `kubectl wait --for=condition=Ready` against the `k8s-keystone-auth` Pod (with a sensible timeout) to the post-kubeadm commands so the cluster is not marked Ready until the webhook actually serves requests. Tests: * Unit-test assertions in `src/features/keystone_auth.rs` updated to match the new sed-pipe postKubeadmCommand and the `--authorization-mode=Node,RBAC,Webhook` JSON-Patch. End-to-end validation: * Created a `enable_keystone_auth=true` cluster on a CAPI baremetal cluster. * Verified `kube-apiserver` Pod stays Running with 0 restarts. * `/etc/kubernetes/manifests/kube-apiserver.yaml` carries exactly one `--authorization-mode=Node,RBAC,Webhook` entry (no duplicate). * `k8s-keystone-auth` Pod Running before cluster bootstrap completes. * KubeadmControlPlane reports `APIServerAvailable=true`; cluster reaches CREATE_COMPLETE. * Keystone-token authenticated `kubectl` requests succeed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Rico Lin <rlin@vexxhost.com>

Rico Lin (ricolin) · 2026-05-14T05:22:03Z

@codex[agent] review this PR

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54f4f3c7bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-05-14T05:25:05Z

+                        // *appends* repeated flags, so a duplicate would yield
+                        // `Node,RBAC,Node,RBAC,Webhook` and apiserver bails with
+                        // "authorization-mode ... has mode specified more than once".
+                        value: Some("kubectl kustomize /etc/kubernetes/keystone-kustomization | sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d' > /etc/kubernetes/manifests/kube-apiserver.yaml".into()),


Do not mask kustomize failures while rewriting the manifest

If kubectl kustomize fails here (for example because the generated kustomization is invalid, the copied resource is missing, or the binary errors), the pipeline still exits with sed's status and the shell redirection has already truncated /etc/kubernetes/manifests/kube-apiserver.yaml. Cloud-init then treats the rewrite as successful while kubelet sees an empty/partial static-pod manifest and removes or crashes the apiserver. Please preserve the upstream command failure, e.g. with set -o pipefail and writing through a temp file before replacing the live manifest.

Useful? React with 👍 / 👎.

Co-authored-by: ricolin <7250045+ricolin@users.noreply.github.com>

Codex · 2026-05-14T05:31:40Z

@codex[agent] review this PR

Reviewed and validated locally (pre-commit + cargo test). Pushed a small follow-up to clarify keystone_auth comments and the healthz wait command: ff39765.

Mohammed Naser (mnaser)

If we're going to touch this feature, I'd like us to leverage this:

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/control-plane-flags/#patches

Use kubeadm's native patch directory for the kube-apiserver static pod changes instead of rewriting the rendered manifest in post-kubeadm commands. This removes the kustomize pipeline that could mask failures and truncate the live manifest. Assisted-By: ChatGPT <noreply@openai.com> Signed-off-by: Rico Lin <rlin@vexxhost.com>

Rico Lin (ricolin) · 2026-05-25T05:28:59Z

recheck

Rico Lin (ricolin) · 2026-05-25T05:31:39Z

recheck

The webhook config points kube-apiserver at 127.0.0.1:8443, so the k8s-keystone-auth DaemonSet must run on each control-plane node. Newer kubeadm clusters label control planes with node-role.kubernetes.io/control-plane instead of the legacy master label, so the old nodeSelector could leave no webhook listener on the API server host. Use node affinity to accept either control-plane or master labels and add TCP probes for the HTTPS listener so readiness reflects the local webhook endpoint. Assisted-By: Codex <noreply@openai.com> Signed-off-by: Rico Lin <rlin@vexxhost.com>

Rico Lin (ricolin) requested a review from Mohammed Naser (mnaser) May 14, 2026 05:21

Codex started work on behalf of Rico Lin (ricolin) May 14, 2026 05:22 View session

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

chore(keystone_auth): clarify comments and healthz wait

bd1ff89

Co-authored-by: ricolin <7250045+ricolin@users.noreply.github.com>

Codex finished work on behalf of Rico Lin (ricolin) May 14, 2026 05:32

Mohammed Naser (mnaser) requested changes May 18, 2026

View reviewed changes

Rico Lin (ricolin) mentioned this pull request May 28, 2026

fix(keystone-auth): schedule webhook on control planes #1052

Merged

Rico Lin (ricolin) force-pushed the keystone-auth-webhook-fallback branch 2 times, most recently from 890d245 to c727b88 Compare June 16, 2026 00:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(keystone_auth): emit Node,RBAC,Webhook authorization-mode and wait for healthz#1018

fix(keystone_auth): emit Node,RBAC,Webhook authorization-mode and wait for healthz#1018
Rico Lin (ricolin) wants to merge 4 commits into
mainfrom
keystone-auth-webhook-fallback

Rico Lin (ricolin) commented May 9, 2026

Uh oh!

Rico Lin (ricolin) commented May 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

Codex AI commented May 14, 2026

Uh oh!

Mohammed Naser (mnaser) left a comment

Uh oh!

Rico Lin (ricolin) commented May 25, 2026

Uh oh!

Rico Lin (ricolin) commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Rico Lin (ricolin) commented May 9, 2026

Summary

Background and analysis

1. Authorization mode never reached the apiserver

2. Duplicate --authorization-mode line crashed the apiserver

3. Cluster bootstrap raced ahead of the keystone webhook deployment

Validation

Notes

Uh oh!

Rico Lin (ricolin) commented May 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Codex AI commented May 14, 2026

Uh oh!

Mohammed Naser (mnaser) left a comment

Choose a reason for hiding this comment

Uh oh!

Rico Lin (ricolin) commented May 25, 2026

Uh oh!

Rico Lin (ricolin) commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. Duplicate `--authorization-mode` line crashed the apiserver