Skip to content

[Bug]: Allocatable nvidia.com/gpu=0 after v0.18.2 upgrade due to config-manager race on startup #1645

@ElsevierAlex

Description

@ElsevierAlex

The template below is mostly useful for bug reports. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version: Amazon Linux 2023 (EKS optimised AMI)
  • Kernel Version: 6.12.66-88.122.amzn2023.x86_64
  • Container Runtime Type/Version: containerd 2.1.5
  • K8s Flavor/Version: EKS 1.35
  • Device Plugin Version: v0.18.2 (worked correctly on v0.18.0)

2. Issue or feature description

Expected behavior: On node provisioning, the device plugin registers the correct nvidia.com/gpu allocatable count and maintains it stably.

Current behavior: After upgrading from v0.18.0 to v0.18.2, newly provisioned GPU nodes consistently register allocatable.nvidia.com/gpu = 0. Restarting the nvidia-device-plugin pod on an affected node immediately restores the correct GPU count. This does not occur on v0.18.0.

Setup: Multi-profile ConfigMap (default, gpu-timesliced-2/3/4)
Profile selected via node label. Nodes provisioned by Karpenter (g5, NVIDIA A10G)
Label set in NodePool.spec.template.metadata.labels (applied by Karpenter after kubelet registers the Node)
Driver preinstalled on AMI (GPU Operator not used)

Scale of impact:

kubectl get nodes -l karpenter.sh/nodepool=gpu-timesliced-4 \
  -o custom-columns="NAME:.metadata.name,CONFIG:.metadata.labels.nvidia\.com/device-plugin\.config,GPU:.status.allocatable.nvidia\.com/gpu"

NAME                                           CONFIG             GPU
ip-10-138-101-229.eu-west-1.compute.internal   gpu-timesliced-4   4
ip-10-138-102-218.eu-west-1.compute.internal   gpu-timesliced-4   4
ip-10-138-103-237.eu-west-1.compute.internal   gpu-timesliced-4   0  ← broken

3. Information to attach

Commit 09d5135 ("Fix race condition in config-manager when label is unset") fixed a bug where the config-manager could hang indefinitely on startup. However, the fix changed the startup behavior:

v0.18.0: A race condition could cause the first Get() call to block forever, which as a side effect meant the empty label was never processed.
v0.18.2: The hang is fixed, but now the sidecar immediately processes the first label value it sees, including an empty string from the informer's initial list.
This exposed a new problem: the label is declared in NodePool.spec.template.metadata.labels, but Karpenter applies template labels to the Node object after kubelet's initial registration. The kubelet is unaware of Karpenter managed labels, so there is a window where the Node exists without the label. During this window, the informer delivers an empty label value, which the sidecar now acts on immediately, triggering a spurious SIGHUP with the wrong config.

Logs:
Init container:

I0305 21:29:47.155  Waiting for change to 'nvidia.com/device-plugin.config' label
I0305 21:29:47.155  Label change detected: nvidia.com/device-plugin.config=
I0305 21:29:47.156  No value set and no default set. Attempting fallback strategies: [named single]
I0305 21:29:47.156  Attempting to find config named: default
I0305 21:29:47.156  Updating to config: default
I0305 21:29:47.156  Successfully updated to config: default

Sidecar container:

I0305 21:29:51.605  Waiting for change to 'nvidia.com/device-plugin.config' label
I0305 21:29:51.605  Label change detected: nvidia.com/device-plugin.config=
I0305 21:29:51.605  No value set and no default set. Attempting fallback strategies: [named single]
I0305 21:29:51.605  Attempting to find config named: default
I0305 21:29:51.605  Updating to config: default
I0305 21:29:51.605  Successfully updated to config: default
I0305 21:29:51.605  Sending signal 'hangup' to 'nvidia-device-plugin'    ← SIGHUP #1
I0305 21:29:51.606  Successfully sent signal
I0305 21:29:51.606  Waiting for change to 'nvidia.com/device-plugin.config' label
I0305 21:29:51.612  Label change detected: nvidia.com/device-plugin.config=gpu-timesliced-4
I0305 21:29:51.612  Updating to config: gpu-timesliced-4
I0305 21:29:51.612  Successfully updated to config: gpu-timesliced-4
I0305 21:29:51.612  Sending signal 'hangup' to 'nvidia-device-plugin'    ← SIGHUP #2
I0305 21:29:51.612  Successfully sent signal

Plugin container (nvidia-device-plugin-ctr):

I0305 21:29:51.613  Running with config: timeSlicing: { replicas: 4 }
I0305 21:29:51.614  Starting GRPC server for 'nvidia.com/gpu'
I0305 21:29:51.614  Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0305 21:29:51.616  Registered device plugin for 'nvidia.com/gpu' with Kubelet

Despite the final config being correct (replicas: 4), allocatable remains 0 until the pod is manually restarted.

nvidia-smi output:

Driver Version  : 580.126.09
CUDA Version    : 13.0
Attached GPUs   : 1
Product Name    : NVIDIA A10G
Architecture    : Ampere

Sequence of Events:

Node initialised without a label: Kubelet registers the Node with the API server. Karpenter managed labels are not yet present.
Sidecar sees an empty label, the informer lists the Node for the first time and finds the config label missing. The sidecar treats this as an empty value.
Sidecar applies the default config. Since no label is set, the sidecar falls back to the default config profile, updates the config file, and sends SIGHUP to the plugin. The plugin restarts with timeSlicing: {} (no replicas).
Real label arrives ~7ms later, Karpenter applies nvidia.com/device-plugin.config=gpu-timesliced-4 to the Node.
Sidecar applies the correct config, the informer detects the label change, the sidecar updates the config file to gpu-timesliced-4, and sends a second SIGHUP. The plugin restarts with timeSlicing: { replicas: 4 }.
GPU count stuck at zero. Despite ending up on the correct config, the two restarts within ~7ms cause the plugin's registration with kubelet to fail silently. Kubelet reports allocatable nvidia.com/gpu = 0 until the pod is manually restarted.

Questions:
Should the config-manager sidecar wait for the informer cache to sync before acting on the first label value, to avoid processing stale/incomplete state?
Is there a recommended way to ensure the config label is present on the Node before the sidecar starts processing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions