A Kubernetes controller that safely handles ESXi maintenance mode transitions
for worker nodes that use PCI passthrough (Intel ARC / NVIDIA / any
passthrough device). Without this controller, entering maintenance on a host
with a passthrough-attached VM hangs indefinitely: vMotion is blocked, and
HostSystem.inMaintenanceMode never flips to true.
The controller detects the operator clicking Enter Maintenance Mode in vCenter, drains the affected Kubernetes node, powers the VM off, and — if possible — migrates it (cold) to another GPU-capable host and brings it back online. When the original host exits maintenance, a powered-off node is returned to service automatically.
Image: ghcr.io/varashi/gpu-node-vsphere-maintenance-controller (public).
HostSystem.inMaintenanceMode only transitions to true once every VM on
the host is either migrated (vMotion) or powered off. PCI passthrough
disables vMotion. Result: maintenance mode hangs until an operator manually
powers off each passthrough VM. In a cluster with several GPU workers, that
manual dance is error-prone and blocks patching, firmware updates, and
hardware swaps.
This controller automates the full lifecycle:
- Detect the
enterMaintenanceModetask the moment it fires. - Cordon and drain the K8s node that maps to the VM on that host.
- Power off the VM (allowing the maintenance task to complete).
- Migrate the VM to a free GPU-capable host if one exists, or wait for the original host to leave maintenance.
- Power on, wait for Node Ready, uncordon.
State is persisted as Node annotations so a controller restart resumes cleanly.
The key trick: maintenance-mode intent is visible via
HostSystem.recentTask with info.name == "EnterMaintenanceMode_Task" and
state == "running". The task never completes for passthrough VMs, but it
appears the instant the operator triggers it. The controller polls for this
task (default every 30s).
Nodes are discovered by label. Default:
intel.feature.node.kubernetes.io/gpu=true
(emitted by the Intel Device
Plugin / NFD).
Configurable via GPU_NODE_LABEL.
The VM name in vSphere must match the Kubernetes Node name exactly.
Annotations on the Node (all prefixed vsphere-maintenance.boeye.net/):
| Key | Values |
|---|---|
state |
draining, powered-off, migrated |
host |
ESXi host that triggered maintenance |
migrated-to-host |
Host VM moved to (when state=migrated) |
transition-time |
ISO8601 of last transition |
Flow on enter maintenance:
- draining — cordon, evict pods (retries every poll for PDBs).
- When drained (or
DRAIN_TIMEOUT_SECONDSelapses) → power off VM → try to migrate:- DRS fully automated: call
PowerOn()and let DRS pick a host. - No DRS: find a free GPU-capable host (
pciPassthruEnabled=true, no existing GPU worker, not in maintenance),RelocateVMcold, thenPowerOn(). - Success → migrated. Failure / no host → powered-off.
- DRS fully automated: call
- migrated — poll until Node Ready → uncordon → clear annotations.
- powered-off — wait for original host to leave maintenance →
PowerOn()→ wait for Ready → uncordon → clear annotations.
Flow on exit maintenance: any Node in powered-off state referencing that
host is powered back on. migrated nodes are already running elsewhere and
are ignored.
Recovery: if a powered-off VM ends up on a different host (DRS race,
operator intervention), the controller notices on the next poll and
transitions it to migrated.
- Kubernetes 1.26+ (eviction API, server-side apply)
- vSphere 7+ (tested on 8.0)
- Workers running as vSphere VMs, with a 1:1
VM name == Node namemapping - vCenter user with:
Virtual Machine → Power Off/On,Virtual Machine → Migrate,Host → Inventory → Read, and task/view privileges - GPU workers labelled so they can be discovered
The chart is published as an OCI artifact alongside the image:
helm upgrade --install gpu-node-vsphere-maintenance \
oci://ghcr.io/varashi/charts/gpu-node-vsphere-maintenance-controller \
--version 0.4.3 \
--namespace gpu-node-vsphere-maintenance --create-namespace \
--set vcenter.host=vcenter.example.com \
--set vcenter.user=maintenance-controller@vsphere.local \
--set vcenter.password='replace-me'To pair with External Secrets Operator, render a Secret containing
VCENTER_HOST / VCENTER_USER / VCENTER_PASSWORD yourself and pass
--set vcenter.existingSecret=<name> instead. To enable TLS verification,
create a ConfigMap containing your CA bundle and pass
--set vcenter.caBundle.configMapName=<name>.
A Flux HelmRelease example:
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
name: gpu-node-vsphere-maintenance-controller
namespace: gpu-node-vsphere-maintenance
spec:
interval: 1h
url: oci://ghcr.io/varashi/charts/gpu-node-vsphere-maintenance-controller
ref:
tag: 0.4.3
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: gpu-node-vsphere-maintenance-controller
namespace: gpu-node-vsphere-maintenance
spec:
interval: 1h
chartRef:
kind: OCIRepository
name: gpu-node-vsphere-maintenance-controller
values:
vcenter:
existingSecret: vsphere-credentials
caBundle:
configMapName: vcenter-caSee chart/values.yaml in this repo for the full value surface.
If you would rather skip Helm, the equivalent manifests (adjust namespace and credentials source as needed):
apiVersion: v1
kind: Namespace
metadata:
name: gpu-node-vsphere-maintenance
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-node-vsphere-maintenance
namespace: gpu-node-vsphere-maintenance
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: gpu-node-vsphere-maintenance
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch", "patch", "update"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-node-vsphere-maintenance
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: gpu-node-vsphere-maintenance
subjects:
- kind: ServiceAccount
name: gpu-node-vsphere-maintenance
namespace: gpu-node-vsphere-maintenance
---
apiVersion: v1
kind: Secret
metadata:
name: vsphere-credentials
namespace: gpu-node-vsphere-maintenance
type: Opaque
stringData:
VCENTER_HOST: vcenter.example.com
VCENTER_USER: maintenance-controller@vsphere.local
VCENTER_PASSWORD: replace-me
---
apiVersion: v1
kind: ConfigMap
metadata:
name: controller-config
namespace: gpu-node-vsphere-maintenance
data:
POLL_INTERVAL_SECONDS: "30"
DRAIN_TIMEOUT_SECONDS: "600"
GUEST_SHUTDOWN_TIMEOUT_SECONDS: "120"
POWER_ON_TIMEOUT_SECONDS: "300"
MAX_CONCURRENT_DRAINS: "1"
GPU_NODE_LABEL: "intel.feature.node.kubernetes.io/gpu=true"
DRY_RUN: "false"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-node-vsphere-maintenance
namespace: gpu-node-vsphere-maintenance
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: gpu-node-vsphere-maintenance
template:
metadata:
labels:
app: gpu-node-vsphere-maintenance
spec:
serviceAccountName: gpu-node-vsphere-maintenance
containers:
- name: controller
image: ghcr.io/varashi/gpu-node-vsphere-maintenance-controller:v0.3.0
envFrom:
- secretRef:
name: vsphere-credentials
- configMapRef:
name: controller-config
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 65532
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: intel.feature.node.kubernetes.io/gpu
operator: DoesNotExistThe node anti-affinity keeps the controller off GPU workers — otherwise it would drain itself.
If you use ESO, replace the plain Secret above with an ExternalSecret
pointing at your secret backend. An example against Bitwarden Secrets Manager:
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: vsphere-credentials
namespace: gpu-node-vsphere-maintenance
spec:
refreshInterval: 1h
secretStoreRef:
kind: ClusterSecretStore
name: bitwarden-secretsmanager
target:
name: vsphere-credentials
template:
type: Opaque
data:
VCENTER_HOST: "vcenter.example.com"
VCENTER_USER: "{{ .username }}"
VCENTER_PASSWORD: "{{ .password }}"
data:
- secretKey: username
remoteRef: { key: VSPHERE_USERNAME }
- secretKey: password
remoteRef: { key: VSPHERE_PASSWORD }| Variable | Default | Description |
|---|---|---|
VCENTER_HOST |
(required) | vCenter FQDN or IP |
VCENTER_USER |
(required) | vCenter username |
VCENTER_PASSWORD |
(required) | vCenter password |
VCENTER_CA_BUNDLE |
(unset) | Path to CA bundle (PEM). When set, enables TLS verification against this CA. Covers self-signed and private/self-hosted CAs. |
VCENTER_TLS_VERIFY |
false |
When true and VCENTER_CA_BUNDLE is unset, verify vCenter against the container's system trust store (publicly-signed certs). |
GPU_NODE_LABEL |
intel.feature.node.kubernetes.io/gpu=true |
Node label selector (key=value) identifying GPU workers |
POLL_INTERVAL_SECONDS |
30 |
How often to poll vSphere for host state changes |
DRAIN_TIMEOUT_SECONDS |
600 |
Max time to wait for a drain to finish before forcing power-off |
GUEST_SHUTDOWN_TIMEOUT_SECONDS |
120 |
Max time to wait for guest OS shutdown (via VMware Tools) before hard power-off |
POWER_ON_TIMEOUT_SECONDS |
300 |
Max time to wait for a powered-on VM's Node to become Ready |
MAX_CONCURRENT_DRAINS |
1 |
Upper bound on simultaneous drain operations |
DRY_RUN |
false |
If true, log actions without executing vSphere / Kubernetes mutations |
Certificate verification against vCenter is opt-in. Default is unverified (homelab-style). Three supported modes:
- Self-signed certificate (default vCenter out of the box): export the
certificate from vCenter, mount it as a
ConfigMapinto the pod, and setVCENTER_CA_BUNDLEto the mounted path (e.g./etc/ssl/vcenter-ca/ca.pem). Use the chart'svcenter.caBundle.configMapNameto wire this up. - Private / self-hosted CA (e.g. an internal AD CS issuing the vCenter
cert): identical to the self-signed case — mount the issuing CA's bundle
and set
VCENTER_CA_BUNDLEto that path. Intermediate certs concatenated into the same PEM work. - Public CA (Let's Encrypt, DigiCert, etc.): set
VCENTER_TLS_VERIFY=trueand leaveVCENTER_CA_BUNDLEunset. The controller uses Python'sssl.create_default_context()with no cafile, which falls back to OpenSSL's system trust store (the CA certs shipped in the container image). In the chart, setvcenter.tlsVerify: true.
VCENTER_CA_BUNDLE takes precedence over VCENTER_TLS_VERIFY when both
are set.
docker build -t ghcr.io/you/gpu-node-vsphere-maintenance-controller:dev .
docker push ghcr.io/you/gpu-node-vsphere-maintenance-controller:devSource layout is deliberately tiny — a single controller.py plus a
minimal Python 3.13 Dockerfile. Dependencies: pyVmomi and the official
Kubernetes Python client.
- DRS vs. controller power-on: if DRS full-automation powers a VM on
concurrently, the
PowerOn()call returns an "already powered on" error. The controller verifies the VM is on a healthy (non-maintenance) host; if it landed on a maintenance host, it re-raises. - Concurrent power-off: symmetric to the power-on race. If a
PowerOff()call lands on a VM that is already off, the controller treats the "Powered off" error as success. - Stale
powered-offannotation: if a VM is already running elsewhere when the controller inspects it, state is advanced tomigratedwithout waiting for the original host to leave maintenance. - Node never becomes Ready after power-on:
POWER_ON_TIMEOUT_SECONDSbounds the wait; after that the controller logs and moves on. The Node stays cordoned until reconciled on a subsequent poll.
- One VM per GPU host is assumed for migration target selection.
- Cluster-level PDBs can prevent draining; the controller does not force evictions.
- No leader election — run
replicas: 1withstrategy.type: Recreate(shown in the example Deployment).Recreateguarantees the old pod has fully terminated before the new one starts, avoiding any double-run race during rollouts. All operations are also idempotent against the state machine as a defence in depth. - Only enter maintenance mode is detected via
recentTask; exit is detected by watchinginMaintenanceModetransition fromtrueback tofalse, so a missed task during a controller restart is picked up on the next poll.
Releases are cut by pushing a SemVer tag vX.Y.Z to main:
git tag v0.3.1
git push origin v0.3.1The release.yaml GitHub Actions workflow then:
- Builds and pushes the controller image to
ghcr.io/varashi/gpu-node-vsphere-maintenance-controller, multi-arch (linux/amd64,linux/arm64), with cosign keyless signatures (GitHub OIDC), an SPDX SBOM, and a build-provenance attestation. - Packages the Helm chart in
chart/withversionandappVersionmatching the tag and pushes it tooci://ghcr.io/varashi/charts/gpu-node-vsphere-maintenance-controller. - Creates a GitHub Release whose body is extracted from the matching
section of
CHANGELOG.mdand attaches the SBOM and the packaged chart.tgz.
Every release is cosign-keyless-signed (GitHub OIDC), carries a SLSA build provenance attestation pushed to the registry, and has the SPDX SBOM attached as a cosign attestation. Verify any of these before deploying:
# 1. Image signature.
cosign verify \
--certificate-identity-regexp 'https://github\.com/Varashi/gpu-node-vsphere-maintenance-controller/\.github/workflows/release\.yaml@refs/tags/v.*' \
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
ghcr.io/varashi/gpu-node-vsphere-maintenance-controller:<tag>
# 2. SBOM attestation (SPDX).
cosign verify-attestation --type spdxjson \
--certificate-identity-regexp 'https://github\.com/Varashi/gpu-node-vsphere-maintenance-controller/\.github/workflows/release\.yaml@refs/tags/v.*' \
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
ghcr.io/varashi/gpu-node-vsphere-maintenance-controller:<tag>
# 3. SLSA build provenance (GitHub Attestations).
gh attestation verify \
oci://ghcr.io/varashi/gpu-node-vsphere-maintenance-controller:<tag> \
--owner VarashiVerify the Helm chart the same way — the release workflow cosign-signs chart digests too:
cosign verify \
--certificate-identity-regexp 'https://github\.com/Varashi/gpu-node-vsphere-maintenance-controller/\.github/workflows/release\.yaml@refs/tags/v.*' \
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
ghcr.io/varashi/charts/gpu-node-vsphere-maintenance-controller:<tag>helm pull --verify is not supported against this chart: --verify
looks for a PGP .prov file (produced by helm package --sign), which
is a separate signing mechanism from cosign keyless. Use cosign verify
above instead.
See CHANGELOG.md for the full history. Released tags
are also listed on the GitHub Releases
page with signed assets and SBOMs.
MIT. See LICENSE (add one if distributing — public use is welcome).