containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck

Original context and jounrnalctl logs here: https://github.com/containerd/containerd/issues/10437

As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.

Timeline of race condition:

- nvidia-container-toolkit and nvidia-device-plugin schedules
- nvidia-device-plugin waits on toolkit-ready file validation via init container
- Patches the config to update nvidia runtime
- Sends SIGHUP and writes toolkit-ready file
- config-manager-init container from nvidia-device-plugin pod enters Created state
- containerd restarts
- config-manager-init forever stuck in Created, hence device plugin never gets to start

Today the only way for us to recover is to manually delete the stuck daemonset pods.

While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.

I think something similar was discovered here but different effect
https://github.com/NVIDIA/gpu-operator/commit/963b8dc87ed54632a7345c1fcfe842f4b7449565
 and was fixed with a sleep

P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior

cc: @klueska 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions