Original context and jounrnalctl logs here: containerd/containerd#10437
As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.
Timeline of race condition:
- nvidia-container-toolkit and nvidia-device-plugin schedules
- nvidia-device-plugin waits on toolkit-ready file validation via init container
- Patches the config to update nvidia runtime
- Sends SIGHUP and writes toolkit-ready file
- config-manager-init container from nvidia-device-plugin pod enters Created state
- containerd restarts
- config-manager-init forever stuck in Created, hence device plugin never gets to start
Today the only way for us to recover is to manually delete the stuck daemonset pods.
While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.
I think something similar was discovered here but different effect
963b8dc
and was fixed with a sleep
P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior
cc: @klueska
Original context and jounrnalctl logs here: containerd/containerd#10437
As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.
Timeline of race condition:
Today the only way for us to recover is to manually delete the stuck daemonset pods.
While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.
I think something similar was discovered here but different effect
963b8dc
and was fixed with a sleep
P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior
cc: @klueska