Skip to content

Commit 07ea2d0

Browse files
committed
Update nvidia-device-plugin-app
1 parent e47cf6b commit 07ea2d0

2 files changed

Lines changed: 56 additions & 7 deletions

File tree

docs/SETUP.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1097,4 +1097,58 @@ As an aside, you can also add an alias on the node in question to shut it down w
10971097

10981098
```bash
10991099
alias shutdown="systemctl poweroff"
1100-
```
1100+
```
1101+
1102+
Once you're done with all of the above, you'll need to go into your BIOS and ensure the following settings are setup:
1103+
1104+
1. ErP is disabled
1105+
2. Wake on PCIe is enabled
1106+
1107+
#### GPU Node Setup
1108+
1109+
Follow Nvidia's setup steps here for `nvidia-container-toolkit`:
1110+
1111+
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
1112+
1113+
1114+
1115+
TODO
1116+
1117+
Test with `containerd` that the Nvidia GPU is available:
1118+
1119+
```bash
1120+
sudo ctr image pull docker.io/nvidia/cuda:12.3.2-base-ubuntu22.04
1121+
1122+
sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.3.2-base-ubuntu22.04 cuda-12.3.2-base-ubuntu22.04 nvidia-smi
1123+
```
1124+
1125+
If there are any issues with `containerd`, make sure that the following is set up for the `/etc/containerd/config.d/99-nvidia.toml` file:
1126+
1127+
```toml
1128+
1129+
```
1130+
1131+
#### Helpful Commands
1132+
1133+
````bash
1134+
version = 2
1135+
1136+
[plugins]
1137+
1138+
[plugins."io.containerd.grpc.v1.cri"]
1139+
1140+
[plugins."io.containerd.grpc.v1.cri".containerd]
1141+
1142+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
1143+
1144+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
1145+
runtime_type = "io.containerd.runc.v2"
1146+
1147+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
1148+
privileged_without_host_devices = false
1149+
runtime_root = ""
1150+
runtime_type = "io.containerd.runc.v2"
1151+
1152+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
1153+
BinaryName = "/usr/bin/nvidia-container-runtime"
1154+
````

helm/nvidia-device-plugin/values.yaml

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ image:
44
pullPolicy: IfNotPresent
55

66
nodeSelector:
7-
# Target GPU nodes specifically
87
kubernetes.io/arch: amd64
98
gpu: "true"
109

@@ -16,29 +15,25 @@ resources:
1615
memory: 64Mi
1716
cpu: 50m
1817

19-
# Security context for the device plugin
2018
securityContext:
2119
privileged: true
2220

23-
# Tolerations for GPU nodes
2421
tolerations:
2522
- key: gpu
2623
operator: Equal
2724
value: "true"
2825
effect: NoExecute
2926

30-
# Environment variables for nvidia-device-plugin
3127
env:
3228
- name: DEVICE_LIST_STRATEGY
33-
value: "envvar" # Use envvar for device list strategy
29+
value: "envvar"
3430
- name: DEVICE_ID_STRATEGY
3531
value: "uuid"
3632
- name: NVIDIA_VISIBLE_DEVICES
3733
value: "all"
3834
- name: NVIDIA_DRIVER_CAPABILITIES
3935
value: "compute,utility"
4036

41-
# Mount the NVIDIA libraries and device files
4237
volumeMounts:
4338
- name: device-plugin
4439
mountPath: /var/lib/kubelet/device-plugins

0 commit comments

Comments
 (0)