The template below is mostly useful for bug reports. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
- Kernel Version: 5.15.0-122-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): v1.27.7
2. Issue or feature description
Summary
A critical cascading failure was observed where the nvidia-device-plugin(v0.18.2)enters a "zombie" state after a partial NVML initialization failure. While the plugin successfully registers with the Kubelet, subsequent internal NVML calls fail. This leads to a high-frequency Pod creation loop that eventually causes kube-apiserver and etcd to crash due to OOM.
Expected Behavior
If NVML fails to initialize during any stage of the plugin lifecycle (especially during Start() or GetPreferredAllocation), the plugin should:
- Mark the node's GPU resources as unhealthy (capacity = 0) OR
- Terminate itself to trigger a Container restart, preventing the scheduler from sending more Pods to a broken node.
Current Behavior (The Bug)
The plugin exhibits inconsistent NVML initialization states:
GetPlugins() succeeds: The plugin registers GPUs to the node, making the node appear "Healthy" to the K8s scheduler.
p.Start() (Health Check) fails: Logs failed to initialize NVML: ERROR_UNKNOWN, but the plugin continues running with health checks disabled.
GetPreferredAllocation fails: When a Pod is scheduled, Kubelet calls this RPC. The plugin calls alignedAlloc -> gpuallocator.NewDevices() -> nvml.Init(). This fails with ERROR_UNKNOWN, returning an error to Kubelet.
- Cascading Failure: The Pod enters
UnexpectedAdmissionError (Failed phase). The Controller-Manager immediately creates a replacement Pod, which is scheduled back to the same node, creating a tight loop that overwhelms the K8s control plane.
Root Cause Analysis (Source Code)
In cmd/nvidia-device-plugin/main.go and internal/plugin/server.go:
-
Redundant Inits: NVML is initialized multiple times across different code paths. In this case, the first init in GetPlugins() worked, but subsequent inits in the CheckHealth() goroutine and alignedAlloc() failed.
-
Lack of Fail-Fast: When server.go:154 logs Failed to start health check, it does not stop the GRPC server or update the device status, leaving the node in a "false-positive" healthy state.
3. Information to attach (optional if deemed irrelevant)
K8s-device-plugin logs
I0213 09:51:50.965469 1 main.go:369] Retrieving plugins.
I0213 09:51:55.562661 1 server.go:197] Starting GRPC server for 'nvidia.com/gpu'
I0213 09:51:55.564144 1 server.go:141] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0213 09:51:55.565671 1 server.go:148] Registered device plugin for 'nvidia.com/gpu' with Kubelet
E0213 09:51:55.600000 1 server.go:154] Failed to start health check: failed to initialize NVML: ERROR_UNKNOWN; continuing with health checks disabled
Kubelet logs (Event showing Admission Error)
Pod Count Surge
Additional information that might help better understand your environment and reproduce the bug:
The template below is mostly useful for bug reports. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Summary
A critical cascading failure was observed where the
nvidia-device-plugin(v0.18.2)enters a "zombie" state after a partial NVML initialization failure. While the plugin successfully registers with the Kubelet, subsequent internal NVML calls fail. This leads to a high-frequency Pod creation loop that eventually causeskube-apiserverandetcdto crash due to OOM.Expected Behavior
If NVML fails to initialize during any stage of the plugin lifecycle (especially during Start() or GetPreferredAllocation), the plugin should:
Current Behavior (The Bug)
The plugin exhibits inconsistent NVML initialization states:
GetPlugins()succeeds: The plugin registers GPUs to the node, making the node appear "Healthy" to the K8s scheduler.p.Start()(Health Check) fails: Logsfailed to initialize NVML: ERROR_UNKNOWN, but the plugin continues running with health checks disabled.GetPreferredAllocationfails: When a Pod is scheduled, Kubelet calls this RPC. The plugin callsalignedAlloc->gpuallocator.NewDevices()->nvml.Init(). This fails withERROR_UNKNOWN, returning an error to Kubelet.UnexpectedAdmissionError(Failed phase). The Controller-Manager immediately creates a replacement Pod, which is scheduled back to the same node, creating a tight loop that overwhelms the K8s control plane.Root Cause Analysis (Source Code)
In
cmd/nvidia-device-plugin/main.goandinternal/plugin/server.go:Redundant Inits: NVML is initialized multiple times across different code paths. In this case, the first init in
GetPlugins()worked, but subsequent inits in theCheckHealth()goroutine andalignedAlloc()failed.Lack of Fail-Fast: When
server.go:154logsFailed to start health check, it does not stop the GRPC server or update the device status, leaving the node in a "false-positive" healthy state.3. Information to attach (optional if deemed irrelevant)
K8s-device-plugin logs
Kubelet logs (Event showing Admission Error)
Pod Count Surge
Additional information that might help better understand your environment and reproduce the bug:
docker versionuname -admesgdpkg -l '*nvidia*'orrpm -qa '*nvidia*'nvidia-ctk --version