Skip to content

fix(keystone_auth): emit Node,RBAC,Webhook authorization-mode and wait for healthz#1018

Open
Rico Lin (ricolin) wants to merge 4 commits into
mainfrom
keystone-auth-webhook-fallback
Open

fix(keystone_auth): emit Node,RBAC,Webhook authorization-mode and wait for healthz#1018
Rico Lin (ricolin) wants to merge 4 commits into
mainfrom
keystone-auth-webhook-fallback

Conversation

@ricolin

Copy link
Copy Markdown
Member

Summary

Fixes enable_keystone_auth=true on CAPI-driven Magnum clusters. Without this PR, clusters that opt in to Keystone webhook authorization either silently bypass the webhook (so Keystone tokens never authorize) or fail to bootstrap entirely because the kube-apiserver static pod crash-loops at startup.

The change is a single commit touching src/features/keystone_auth.rs:

  • Set --authorization-mode=Node,RBAC,Webhook via a JSON-Patch op against the kustomize input.
  • Strip the kubeadm-default --authorization-mode=Node,RBAC line from the kustomize output before writing the manifest back, so kubelet only ever sees a single --authorization-mode line.
  • Wait for the k8s-keystone-auth Pod to become Ready before declaring post-kubeadm complete.

Background and analysis

1. Authorization mode never reached the apiserver

The pre-existing keystone_auth feature kustomized /etc/kubernetes/manifests/kube-apiserver.yaml to add the keystone webhook authorizer, but did not adjust --authorization-mode. The running apiserver therefore had only the kubeadm-default Node,RBAC modes, never consulted the webhook, and every Keystone-token authenticated request failed with RBAC: forbidden.

2. Duplicate --authorization-mode line crashed the apiserver

kubeadm always writes --authorization-mode=Node,RBAC into its apiserver static-pod manifest. The feature's kustomization runs after kubeadm, so simply appending --authorization-mode=Node,RBAC,Webhook would leave the manifest carrying both lines:

- --authorization-mode=Node,RBAC          # written by kubeadm
- --authorization-mode=Node,RBAC,Webhook  # appended by us

--authorization-mode is a pflag.StringSliceVar which appends repeated flag values rather than last-occurrence-wins. The apiserver process therefore receives ["Node","RBAC","Node","RBAC","Webhook"] and bails at startup with:

authorization-mode ["Node" "RBAC" "Node" "RBAC" "Webhook"]
  has mode specified more than once

kube-apiserver enters CrashLoopBackOff, KCP never reaches APIServerAvailable=true, the cluster fails NodeStartupTimeout. This is the failure mode every enable_keystone_auth=true cluster hits without this PR.

The fix here pipes the kustomize output through sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d' before writing the manifest, so the kubeadm-default line is removed and only the Node,RBAC,Webhook line remains.

3. Cluster bootstrap raced ahead of the keystone webhook deployment

The post-kubeadm command sequence proceeded as soon as the apiserver TCP port was open, but the k8s-keystone-auth Deployment was still rolling out. Webhook calls during that window returned connection refused, which the apiserver caches negatively and surfaces as authentication failures for several seconds after the webhook is actually up.

Fix: append kubectl wait --for=condition=Ready against the k8s-keystone-auth Pod to the post-kubeadm commands so the cluster does not register Ready until the webhook actually serves requests.

Validation

End-to-end on a CAPI baremetal cluster created with enable_keystone_auth=true:

$ kubectl -n kube-system get pod -l component=kube-apiserver
NAME                          READY   STATUS    RESTARTS   AGE
kube-apiserver-<master>       1/1     Running   0          8m44s

$ ssh master 'grep authorization-mode /etc/kubernetes/manifests/kube-apiserver.yaml'
    - --authorization-mode=Node,RBAC,Webhook        # exactly one occurrence

$ kubectl -n kube-system get pod -l app=k8s-keystone-auth
NAME                      READY   STATUS    RESTARTS   AGE
k8s-keystone-auth-XXXXX   1/1     Running   0          7m59s
  • KubeadmControlPlane reports APIServerAvailable=true.
  • Cluster reaches CREATE_COMPLETE cleanly.
  • Keystone-token authenticated kubectl requests succeed.

Two unit tests in src/features/keystone_auth.rs cover the patch emission and the post-kubeadm command pipeline; both pass with cargo test.

Notes

  • The sed filter is anchored on whole-line, leading-whitespace tolerant, and only matches the literal --authorization-mode=Node,RBAC value — it cannot accidentally remove a future Node,RBAC,Foo value or any line that merely contains the substring.
  • No new label, no new operator-facing knob; the feature is a behaviour fix gated by the existing enable_keystone_auth=true label.

…t for healthz

Two independent issues prevented `enable_keystone_auth=true` from
working end-to-end on baremetal CAPI clusters.

1. Authorization mode never reached the apiserver

   The `keystone_auth` feature kustomized
   `/etc/kubernetes/manifests/kube-apiserver.yaml` to add the keystone
   webhook authorizer, but did not adjust `--authorization-mode`. The
   running apiserver therefore had only the kubeadm-default
   `Node,RBAC` modes, never consulted the webhook, and every Keystone-
   token authenticated request failed with `RBAC: forbidden`.

   Fix: emit a JSON-Patch op that *replaces* (not appends) the
   apiserver `--authorization-mode` flag on the kustomize input.
   The replacement value is `Node,RBAC,Webhook`. Append-style ops
   are not used here because `--authorization-mode` is a
   `pflag.StringSliceVar` which appends repeated values rather than
   last-occurrence-wins, so a stray duplicate would be fatal — see
   point 2 below.

2. Duplicate `--authorization-mode` line crashed the apiserver

   kubeadm always writes `--authorization-mode=Node,RBAC` into its
   apiserver static-pod manifest. The feature's kustomization runs
   AFTER kubeadm, so even after fixing point 1 above, the live
   manifest carries BOTH lines:

     - --authorization-mode=Node,RBAC          # written by kubeadm
     - --authorization-mode=Node,RBAC,Webhook  # added by us

   `--authorization-mode` is a `pflag.StringSliceVar` which appends
   repeated values; the apiserver process therefore sees the merged
   list `["Node","RBAC","Node","RBAC","Webhook"]` and bails at
   startup with:

     authorization-mode ["Node" "RBAC" "Node" "RBAC" "Webhook"]
       has mode specified more than once

   `kube-apiserver` enters CrashLoopBackOff, the KubeadmControlPlane
   never reaches `APIServerAvailable=true`, and the cluster
   eventually fails the NodeStartupTimeout window.

   Fix: pipe the kustomize output through
   `sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d'`
   before writing the manifest back. The kubeadm-default line is
   stripped before kubelet picks the file up; the kubelet only ever
   sees the de-duped manifest with a single
   `--authorization-mode=Node,RBAC,Webhook` line.

3. Cluster bootstrap raced ahead of the keystone webhook deployment

   The post-kubeadm command sequence proceeded as soon as the
   apiserver TCP port was open, but the keystone webhook Deployment
   was still rolling out. Webhook calls during that window returned
   `connection refused`, which the apiserver caches negatively and
   surfaces as authentication failures for several seconds after the
   webhook is actually up.

   Fix: append a `kubectl wait --for=condition=Ready` against the
   `k8s-keystone-auth` Pod (with a sensible timeout) to the
   post-kubeadm commands so the cluster is not marked Ready until the
   webhook actually serves requests.

Tests:

  * Unit-test assertions in `src/features/keystone_auth.rs` updated to
    match the new sed-pipe postKubeadmCommand and the
    `--authorization-mode=Node,RBAC,Webhook` JSON-Patch.

End-to-end validation:

  * Created a `enable_keystone_auth=true` cluster on a CAPI baremetal
    cluster.
  * Verified `kube-apiserver` Pod stays Running with 0 restarts.
  * `/etc/kubernetes/manifests/kube-apiserver.yaml` carries exactly
    one `--authorization-mode=Node,RBAC,Webhook` entry (no duplicate).
  * `k8s-keystone-auth` Pod Running before cluster bootstrap completes.
  * KubeadmControlPlane reports `APIServerAvailable=true`; cluster
    reaches CREATE_COMPLETE.
  * Keystone-token authenticated `kubectl` requests succeed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Rico Lin <rlin@vexxhost.com>
@ricolin

Copy link
Copy Markdown
Member Author

@codex[agent] review this PR

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54f4f3c7bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/features/keystone_auth.rs Outdated
// *appends* repeated flags, so a duplicate would yield
// `Node,RBAC,Node,RBAC,Webhook` and apiserver bails with
// "authorization-mode ... has mode specified more than once".
value: Some("kubectl kustomize /etc/kubernetes/keystone-kustomization | sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d' > /etc/kubernetes/manifests/kube-apiserver.yaml".into()),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not mask kustomize failures while rewriting the manifest

If kubectl kustomize fails here (for example because the generated kustomization is invalid, the copied resource is missing, or the binary errors), the pipeline still exits with sed's status and the shell redirection has already truncated /etc/kubernetes/manifests/kube-apiserver.yaml. Cloud-init then treats the rewrite as successful while kubelet sees an empty/partial static-pod manifest and removes or crashes the apiserver. Please preserve the upstream command failure, e.g. with set -o pipefail and writing through a temp file before replacing the live manifest.

Useful? React with 👍 / 👎.

Co-authored-by: ricolin <7250045+ricolin@users.noreply.github.com>
@Codex

Codex AI commented May 14, 2026

Copy link
Copy Markdown

@codex[agent] review this PR

Reviewed and validated locally (pre-commit + cargo test). Pushed a small follow-up to clarify keystone_auth comments and the healthz wait command: ff39765.

@mnaser Mohammed Naser (mnaser) left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to touch this feature, I'd like us to leverage this:

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/control-plane-flags/#patches

Use kubeadm's native patch directory for the kube-apiserver static pod
changes instead of rewriting the rendered manifest in post-kubeadm
commands. This removes the kustomize pipeline that could mask failures
and truncate the live manifest.

Assisted-By: ChatGPT <noreply@openai.com>
Signed-off-by: Rico Lin <rlin@vexxhost.com>
@ricolin

Copy link
Copy Markdown
Member Author

recheck

1 similar comment
@ricolin

Copy link
Copy Markdown
Member Author

recheck

The webhook config points kube-apiserver at 127.0.0.1:8443, so the
k8s-keystone-auth DaemonSet must run on each control-plane node. Newer
kubeadm clusters label control planes with node-role.kubernetes.io/control-plane
instead of the legacy master label, so the old nodeSelector could leave no
webhook listener on the API server host.

Use node affinity to accept either control-plane or master labels and add TCP
probes for the HTTPS listener so readiness reflects the local webhook endpoint.

Assisted-By: Codex <noreply@openai.com>
Signed-off-by: Rico Lin <rlin@vexxhost.com>
@ricolin Rico Lin (ricolin) force-pushed the keystone-auth-webhook-fallback branch 2 times, most recently from 890d245 to c727b88 Compare June 16, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants