fix(keystone_auth): emit Node,RBAC,Webhook authorization-mode and wait for healthz#1018
fix(keystone_auth): emit Node,RBAC,Webhook authorization-mode and wait for healthz#1018Rico Lin (ricolin) wants to merge 4 commits into
Conversation
…t for healthz
Two independent issues prevented `enable_keystone_auth=true` from
working end-to-end on baremetal CAPI clusters.
1. Authorization mode never reached the apiserver
The `keystone_auth` feature kustomized
`/etc/kubernetes/manifests/kube-apiserver.yaml` to add the keystone
webhook authorizer, but did not adjust `--authorization-mode`. The
running apiserver therefore had only the kubeadm-default
`Node,RBAC` modes, never consulted the webhook, and every Keystone-
token authenticated request failed with `RBAC: forbidden`.
Fix: emit a JSON-Patch op that *replaces* (not appends) the
apiserver `--authorization-mode` flag on the kustomize input.
The replacement value is `Node,RBAC,Webhook`. Append-style ops
are not used here because `--authorization-mode` is a
`pflag.StringSliceVar` which appends repeated values rather than
last-occurrence-wins, so a stray duplicate would be fatal — see
point 2 below.
2. Duplicate `--authorization-mode` line crashed the apiserver
kubeadm always writes `--authorization-mode=Node,RBAC` into its
apiserver static-pod manifest. The feature's kustomization runs
AFTER kubeadm, so even after fixing point 1 above, the live
manifest carries BOTH lines:
- --authorization-mode=Node,RBAC # written by kubeadm
- --authorization-mode=Node,RBAC,Webhook # added by us
`--authorization-mode` is a `pflag.StringSliceVar` which appends
repeated values; the apiserver process therefore sees the merged
list `["Node","RBAC","Node","RBAC","Webhook"]` and bails at
startup with:
authorization-mode ["Node" "RBAC" "Node" "RBAC" "Webhook"]
has mode specified more than once
`kube-apiserver` enters CrashLoopBackOff, the KubeadmControlPlane
never reaches `APIServerAvailable=true`, and the cluster
eventually fails the NodeStartupTimeout window.
Fix: pipe the kustomize output through
`sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d'`
before writing the manifest back. The kubeadm-default line is
stripped before kubelet picks the file up; the kubelet only ever
sees the de-duped manifest with a single
`--authorization-mode=Node,RBAC,Webhook` line.
3. Cluster bootstrap raced ahead of the keystone webhook deployment
The post-kubeadm command sequence proceeded as soon as the
apiserver TCP port was open, but the keystone webhook Deployment
was still rolling out. Webhook calls during that window returned
`connection refused`, which the apiserver caches negatively and
surfaces as authentication failures for several seconds after the
webhook is actually up.
Fix: append a `kubectl wait --for=condition=Ready` against the
`k8s-keystone-auth` Pod (with a sensible timeout) to the
post-kubeadm commands so the cluster is not marked Ready until the
webhook actually serves requests.
Tests:
* Unit-test assertions in `src/features/keystone_auth.rs` updated to
match the new sed-pipe postKubeadmCommand and the
`--authorization-mode=Node,RBAC,Webhook` JSON-Patch.
End-to-end validation:
* Created a `enable_keystone_auth=true` cluster on a CAPI baremetal
cluster.
* Verified `kube-apiserver` Pod stays Running with 0 restarts.
* `/etc/kubernetes/manifests/kube-apiserver.yaml` carries exactly
one `--authorization-mode=Node,RBAC,Webhook` entry (no duplicate).
* `k8s-keystone-auth` Pod Running before cluster bootstrap completes.
* KubeadmControlPlane reports `APIServerAvailable=true`; cluster
reaches CREATE_COMPLETE.
* Keystone-token authenticated `kubectl` requests succeed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Rico Lin <rlin@vexxhost.com>
|
@codex[agent] review this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54f4f3c7bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
| // *appends* repeated flags, so a duplicate would yield | ||
| // `Node,RBAC,Node,RBAC,Webhook` and apiserver bails with | ||
| // "authorization-mode ... has mode specified more than once". | ||
| value: Some("kubectl kustomize /etc/kubernetes/keystone-kustomization | sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d' > /etc/kubernetes/manifests/kube-apiserver.yaml".into()), |
There was a problem hiding this comment.
Do not mask kustomize failures while rewriting the manifest
If kubectl kustomize fails here (for example because the generated kustomization is invalid, the copied resource is missing, or the binary errors), the pipeline still exits with sed's status and the shell redirection has already truncated /etc/kubernetes/manifests/kube-apiserver.yaml. Cloud-init then treats the rewrite as successful while kubelet sees an empty/partial static-pod manifest and removes or crashes the apiserver. Please preserve the upstream command failure, e.g. with set -o pipefail and writing through a temp file before replacing the live manifest.
Useful? React with 👍 / 👎.
Co-authored-by: ricolin <7250045+ricolin@users.noreply.github.com>
Reviewed and validated locally (pre-commit + cargo test). Pushed a small follow-up to clarify keystone_auth comments and the healthz wait command: ff39765. |
Mohammed Naser (mnaser)
left a comment
There was a problem hiding this comment.
If we're going to touch this feature, I'd like us to leverage this:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/control-plane-flags/#patches
Use kubeadm's native patch directory for the kube-apiserver static pod changes instead of rewriting the rendered manifest in post-kubeadm commands. This removes the kustomize pipeline that could mask failures and truncate the live manifest. Assisted-By: ChatGPT <noreply@openai.com> Signed-off-by: Rico Lin <rlin@vexxhost.com>
|
recheck |
1 similar comment
|
recheck |
The webhook config points kube-apiserver at 127.0.0.1:8443, so the k8s-keystone-auth DaemonSet must run on each control-plane node. Newer kubeadm clusters label control planes with node-role.kubernetes.io/control-plane instead of the legacy master label, so the old nodeSelector could leave no webhook listener on the API server host. Use node affinity to accept either control-plane or master labels and add TCP probes for the HTTPS listener so readiness reflects the local webhook endpoint. Assisted-By: Codex <noreply@openai.com> Signed-off-by: Rico Lin <rlin@vexxhost.com>
890d245 to
c727b88
Compare
Summary
Fixes
enable_keystone_auth=trueon CAPI-driven Magnum clusters. Without this PR, clusters that opt in to Keystone webhook authorization either silently bypass the webhook (so Keystone tokens never authorize) or fail to bootstrap entirely because the kube-apiserver static pod crash-loops at startup.The change is a single commit touching
src/features/keystone_auth.rs:--authorization-mode=Node,RBAC,Webhookvia a JSON-Patch op against the kustomize input.--authorization-mode=Node,RBACline from the kustomize output before writing the manifest back, so kubelet only ever sees a single--authorization-modeline.k8s-keystone-authPod to become Ready before declaring post-kubeadm complete.Background and analysis
1. Authorization mode never reached the apiserver
The pre-existing
keystone_authfeature kustomized/etc/kubernetes/manifests/kube-apiserver.yamlto add the keystone webhook authorizer, but did not adjust--authorization-mode. The running apiserver therefore had only the kubeadm-defaultNode,RBACmodes, never consulted the webhook, and every Keystone-token authenticated request failed withRBAC: forbidden.2. Duplicate
--authorization-modeline crashed the apiserverkubeadm always writes
--authorization-mode=Node,RBACinto its apiserver static-pod manifest. The feature's kustomization runs after kubeadm, so simply appending--authorization-mode=Node,RBAC,Webhookwould leave the manifest carrying both lines:--authorization-modeis apflag.StringSliceVarwhich appends repeated flag values rather than last-occurrence-wins. The apiserver process therefore receives["Node","RBAC","Node","RBAC","Webhook"]and bails at startup with:kube-apiserverenters CrashLoopBackOff, KCP never reachesAPIServerAvailable=true, the cluster fails NodeStartupTimeout. This is the failure mode everyenable_keystone_auth=truecluster hits without this PR.The fix here pipes the kustomize output through
sed '/^[[:space:]]*- --authorization-mode=Node,RBAC$/d'before writing the manifest, so the kubeadm-default line is removed and only theNode,RBAC,Webhookline remains.3. Cluster bootstrap raced ahead of the keystone webhook deployment
The post-kubeadm command sequence proceeded as soon as the apiserver TCP port was open, but the
k8s-keystone-authDeployment was still rolling out. Webhook calls during that window returnedconnection refused, which the apiserver caches negatively and surfaces as authentication failures for several seconds after the webhook is actually up.Fix: append
kubectl wait --for=condition=Readyagainst thek8s-keystone-authPod to the post-kubeadm commands so the cluster does not register Ready until the webhook actually serves requests.Validation
End-to-end on a CAPI baremetal cluster created with
enable_keystone_auth=true:APIServerAvailable=true.CREATE_COMPLETEcleanly.kubectlrequests succeed.Two unit tests in
src/features/keystone_auth.rscover the patch emission and the post-kubeadm command pipeline; both pass withcargo test.Notes
sedfilter is anchored on whole-line, leading-whitespace tolerant, and only matches the literal--authorization-mode=Node,RBACvalue — it cannot accidentally remove a futureNode,RBAC,Foovalue or any line that merely contains the substring.enable_keystone_auth=truelabel.