[Bug]: Inconsistent NVML initialization state leads to "Zombie Node" and Control Plane OOM

_The template below is mostly useful for bug reports. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

_**Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._

### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
* Kernel Version: 5.15.0-122-generic
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): v1.27.7

### 2. Issue or feature description
#### Summary
A critical cascading failure was observed where the `nvidia-device-plugin`(v0.18.2)enters a "zombie" state after a partial NVML initialization failure. While the plugin successfully registers with the Kubelet, subsequent internal NVML calls fail. This leads to a high-frequency Pod creation loop that eventually causes `kube-apiserver` and `etcd `to crash due to OOM.

#### Expected Behavior
If NVML fails to initialize during any stage of the plugin lifecycle (especially during Start() or GetPreferredAllocation), the plugin should:
- Mark the node's GPU resources as unhealthy (capacity = 0) OR
- Terminate itself to trigger a Container restart, preventing the scheduler from sending more Pods to a broken node.

#### Current Behavior (The Bug)
The plugin exhibits inconsistent NVML initialization states:
1. `GetPlugins() `**succeeds**: The plugin registers GPUs to the node, making the node appear "Healthy" to the K8s scheduler.
2.  `p.Start()` (**Health Check**) fails: Logs `failed to initialize NVML: ERROR_UNKNOWN, but the plugin continues running with health checks disabled.`
3.  `GetPreferredAllocation` **fails**: When a Pod is scheduled, Kubelet calls this RPC. The plugin calls `alignedAlloc` -> `gpuallocator.NewDevices()` -> `nvml.Init()`. This fails with `ERROR_UNKNOWN`, returning an error to Kubelet.
4. **Cascading Failure**: The Pod enters `UnexpectedAdmissionError` (Failed phase). The Controller-Manager immediately creates a replacement Pod, which is scheduled back to the same node, creating a tight loop that overwhelms the K8s control plane.

#### Root Cause Analysis (Source Code)
In `cmd/nvidia-device-plugin/main.go` and `internal/plugin/server.go`:

- **Redundant Inits**: NVML is initialized multiple times across different code paths. In this case, the first init in `GetPlugins()` worked, but subsequent inits in the `CheckHealth()` goroutine and `alignedAlloc()` failed.

- **Lack of Fail-Fast**: When `server.go:154 `logs `Failed to start health check`, it does not stop the GRPC server or update the device status, leaving the node in a "false-positive" healthy state.


### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)
K8s-device-plugin logs
```
I0213 09:51:50.965469       1 main.go:369] Retrieving plugins.
I0213 09:51:55.562661       1 server.go:197] Starting GRPC server for 'nvidia.com/gpu'
I0213 09:51:55.564144       1 server.go:141] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0213 09:51:55.565671       1 server.go:148] Registered device plugin for 'nvidia.com/gpu' with Kubelet
E0213 09:51:55.600000       1 server.go:154] Failed to start health check: failed to initialize NVML: ERROR_UNKNOWN; continuing with health checks disabled
```

Kubelet logs (Event showing Admission Error)

<img width="3014" height="270" alt="Image" src="https://github.com/user-attachments/assets/e56e386d-eb08-4119-888b-44ced20fc190" />

Pod Count Surge

<img width="1142" height="442" alt="Image" src="https://github.com/user-attachments/assets/9aadbf84-1861-4919-bac2-6006529c2dc4" />


Additional information that might help better understand your environment and reproduce the bug:
 - [ ] Docker version from `docker version`
 - [ ] Docker command, image and tag used
 - [ ] Kernel version from `uname -a`
 - [ ] Any relevant kernel output lines from `dmesg`
 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`
 - [ ] NVIDIA Container Toolkit version from `nvidia-ctk --version`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Inconsistent NVML initialization state leads to "Zombie Node" and Control Plane OOM #1640

1. Quick Debug Information

2. Issue or feature description

Summary

Expected Behavior

Current Behavior (The Bug)

Root Cause Analysis (Source Code)

3. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Inconsistent NVML initialization state leads to "Zombie Node" and Control Plane OOM #1640

Description

1. Quick Debug Information

2. Issue or feature description

Summary

Expected Behavior

Current Behavior (The Bug)

Root Cause Analysis (Source Code)

3. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions