Skip to content

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

@chiragjn

Description

@chiragjn

Original context and jounrnalctl logs here: containerd/containerd#10437

As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.

Timeline of race condition:

  • nvidia-container-toolkit and nvidia-device-plugin schedules
  • nvidia-device-plugin waits on toolkit-ready file validation via init container
  • Patches the config to update nvidia runtime
  • Sends SIGHUP and writes toolkit-ready file
  • config-manager-init container from nvidia-device-plugin pod enters Created state
  • containerd restarts
  • config-manager-init forever stuck in Created, hence device plugin never gets to start

Today the only way for us to recover is to manually delete the stuck daemonset pods.

While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.

I think something similar was discovered here but different effect
963b8dc
and was fixed with a sleep

P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior

cc: @klueska

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprovements to existing features, performance, or usability (not bug fixes or new features).needs-triageissue or PR has not been assigned a priority-px labelquestionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions