The template below is mostly useful for bug reports. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version: Amazon Linux 2023 (EKS optimised AMI)
- Kernel Version: 6.12.66-88.122.amzn2023.x86_64
- Container Runtime Type/Version: containerd 2.1.5
- K8s Flavor/Version: EKS 1.35
- Device Plugin Version: v0.18.2 (worked correctly on v0.18.0)
2. Issue or feature description
Expected behavior: On node provisioning, the device plugin registers the correct nvidia.com/gpu allocatable count and maintains it stably.
Current behavior: After upgrading from v0.18.0 to v0.18.2, newly provisioned GPU nodes consistently register allocatable.nvidia.com/gpu = 0. Restarting the nvidia-device-plugin pod on an affected node immediately restores the correct GPU count. This does not occur on v0.18.0.
Setup: Multi-profile ConfigMap (default, gpu-timesliced-2/3/4)
Profile selected via node label. Nodes provisioned by Karpenter (g5, NVIDIA A10G)
Label set in NodePool.spec.template.metadata.labels (applied by Karpenter after kubelet registers the Node)
Driver preinstalled on AMI (GPU Operator not used)
Scale of impact:
kubectl get nodes -l karpenter.sh/nodepool=gpu-timesliced-4 \
-o custom-columns="NAME:.metadata.name,CONFIG:.metadata.labels.nvidia\.com/device-plugin\.config,GPU:.status.allocatable.nvidia\.com/gpu"
NAME CONFIG GPU
ip-10-138-101-229.eu-west-1.compute.internal gpu-timesliced-4 4
ip-10-138-102-218.eu-west-1.compute.internal gpu-timesliced-4 4
ip-10-138-103-237.eu-west-1.compute.internal gpu-timesliced-4 0 ← broken
3. Information to attach
Commit 09d5135 ("Fix race condition in config-manager when label is unset") fixed a bug where the config-manager could hang indefinitely on startup. However, the fix changed the startup behavior:
v0.18.0: A race condition could cause the first Get() call to block forever, which as a side effect meant the empty label was never processed.
v0.18.2: The hang is fixed, but now the sidecar immediately processes the first label value it sees, including an empty string from the informer's initial list.
This exposed a new problem: the label is declared in NodePool.spec.template.metadata.labels, but Karpenter applies template labels to the Node object after kubelet's initial registration. The kubelet is unaware of Karpenter managed labels, so there is a window where the Node exists without the label. During this window, the informer delivers an empty label value, which the sidecar now acts on immediately, triggering a spurious SIGHUP with the wrong config.
Logs:
Init container:
I0305 21:29:47.155 Waiting for change to 'nvidia.com/device-plugin.config' label
I0305 21:29:47.155 Label change detected: nvidia.com/device-plugin.config=
I0305 21:29:47.156 No value set and no default set. Attempting fallback strategies: [named single]
I0305 21:29:47.156 Attempting to find config named: default
I0305 21:29:47.156 Updating to config: default
I0305 21:29:47.156 Successfully updated to config: default
Sidecar container:
I0305 21:29:51.605 Waiting for change to 'nvidia.com/device-plugin.config' label
I0305 21:29:51.605 Label change detected: nvidia.com/device-plugin.config=
I0305 21:29:51.605 No value set and no default set. Attempting fallback strategies: [named single]
I0305 21:29:51.605 Attempting to find config named: default
I0305 21:29:51.605 Updating to config: default
I0305 21:29:51.605 Successfully updated to config: default
I0305 21:29:51.605 Sending signal 'hangup' to 'nvidia-device-plugin' ← SIGHUP #1
I0305 21:29:51.606 Successfully sent signal
I0305 21:29:51.606 Waiting for change to 'nvidia.com/device-plugin.config' label
I0305 21:29:51.612 Label change detected: nvidia.com/device-plugin.config=gpu-timesliced-4
I0305 21:29:51.612 Updating to config: gpu-timesliced-4
I0305 21:29:51.612 Successfully updated to config: gpu-timesliced-4
I0305 21:29:51.612 Sending signal 'hangup' to 'nvidia-device-plugin' ← SIGHUP #2
I0305 21:29:51.612 Successfully sent signal
Plugin container (nvidia-device-plugin-ctr):
I0305 21:29:51.613 Running with config: timeSlicing: { replicas: 4 }
I0305 21:29:51.614 Starting GRPC server for 'nvidia.com/gpu'
I0305 21:29:51.614 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0305 21:29:51.616 Registered device plugin for 'nvidia.com/gpu' with Kubelet
Despite the final config being correct (replicas: 4), allocatable remains 0 until the pod is manually restarted.
nvidia-smi output:
Driver Version : 580.126.09
CUDA Version : 13.0
Attached GPUs : 1
Product Name : NVIDIA A10G
Architecture : Ampere
Sequence of Events:
Node initialised without a label: Kubelet registers the Node with the API server. Karpenter managed labels are not yet present.
Sidecar sees an empty label, the informer lists the Node for the first time and finds the config label missing. The sidecar treats this as an empty value.
Sidecar applies the default config. Since no label is set, the sidecar falls back to the default config profile, updates the config file, and sends SIGHUP to the plugin. The plugin restarts with timeSlicing: {} (no replicas).
Real label arrives ~7ms later, Karpenter applies nvidia.com/device-plugin.config=gpu-timesliced-4 to the Node.
Sidecar applies the correct config, the informer detects the label change, the sidecar updates the config file to gpu-timesliced-4, and sends a second SIGHUP. The plugin restarts with timeSlicing: { replicas: 4 }.
GPU count stuck at zero. Despite ending up on the correct config, the two restarts within ~7ms cause the plugin's registration with kubelet to fail silently. Kubelet reports allocatable nvidia.com/gpu = 0 until the pod is manually restarted.
Questions:
Should the config-manager sidecar wait for the informer cache to sync before acting on the first label value, to avoid processing stale/incomplete state?
Is there a recommended way to ensure the config label is present on the Node before the sidecar starts processing?
The template below is mostly useful for bug reports. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Expected behavior: On node provisioning, the device plugin registers the correct nvidia.com/gpu allocatable count and maintains it stably.
Current behavior: After upgrading from v0.18.0 to v0.18.2, newly provisioned GPU nodes consistently register allocatable.nvidia.com/gpu = 0. Restarting the nvidia-device-plugin pod on an affected node immediately restores the correct GPU count. This does not occur on v0.18.0.
Setup: Multi-profile ConfigMap (default, gpu-timesliced-2/3/4)
Profile selected via node label. Nodes provisioned by Karpenter (g5, NVIDIA A10G)
Label set in NodePool.spec.template.metadata.labels (applied by Karpenter after kubelet registers the Node)
Driver preinstalled on AMI (GPU Operator not used)
Scale of impact:
3. Information to attach
Commit 09d5135 ("Fix race condition in config-manager when label is unset") fixed a bug where the config-manager could hang indefinitely on startup. However, the fix changed the startup behavior:
v0.18.0: A race condition could cause the first Get() call to block forever, which as a side effect meant the empty label was never processed.
v0.18.2: The hang is fixed, but now the sidecar immediately processes the first label value it sees, including an empty string from the informer's initial list.
This exposed a new problem: the label is declared in NodePool.spec.template.metadata.labels, but Karpenter applies template labels to the Node object after kubelet's initial registration. The kubelet is unaware of Karpenter managed labels, so there is a window where the Node exists without the label. During this window, the informer delivers an empty label value, which the sidecar now acts on immediately, triggering a spurious SIGHUP with the wrong config.
Logs:
Init container:
Sidecar container:
Plugin container (nvidia-device-plugin-ctr):
Despite the final config being correct (replicas: 4), allocatable remains 0 until the pod is manually restarted.
nvidia-smi output:
Sequence of Events:
Node initialised without a label: Kubelet registers the Node with the API server. Karpenter managed labels are not yet present.
Sidecar sees an empty label, the informer lists the Node for the first time and finds the config label missing. The sidecar treats this as an empty value.
Sidecar applies the default config. Since no label is set, the sidecar falls back to the default config profile, updates the config file, and sends SIGHUP to the plugin. The plugin restarts with timeSlicing: {} (no replicas).
Real label arrives ~7ms later, Karpenter applies nvidia.com/device-plugin.config=gpu-timesliced-4 to the Node.
Sidecar applies the correct config, the informer detects the label change, the sidecar updates the config file to gpu-timesliced-4, and sends a second SIGHUP. The plugin restarts with timeSlicing: { replicas: 4 }.
GPU count stuck at zero. Despite ending up on the correct config, the two restarts within ~7ms cause the plugin's registration with kubelet to fail silently. Kubelet reports allocatable nvidia.com/gpu = 0 until the pod is manually restarted.
Questions:
Should the config-manager sidecar wait for the informer cache to sync before acting on the first label value, to avoid processing stale/incomplete state?
Is there a recommended way to ensure the config label is present on the Node before the sidecar starts processing?