Centos 7环境下,安装NVIDIA Container和K8S的GPU插件的操作命令
NVIDIA Docker依赖
1
2
3
4
5
6
7
8
9
10
11
| sudo yum install -y tar bzip2 make automake gcc gcc-c++ vim pciutils elfutils-libelf-devel libglvnd-devel iptables
### Setup the official Docker CE repository:
sudo yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
### Now you can observe the packages available from the docker-ce repo:
sudo yum repolist -v
#### 生成yum缓存
sudo yum makecache
|
NVIDIA Docker2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
### Clear installed old version package
# rpm -qa|grep nvidia
# yum info installed |grep nvidia
sudo yum remove -y nvidia-docker
sudo yum remove -y nvidia-docker2
## 如果原有版本使用rpm方式安装,则清理rpm包
rpm -qa|grep nvidia |grep -E "libnvidia-container|nvidia-container-runtime" |xargs rpm -e
### Setup the stable repository and the GPG key:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum clean expire-cache
### 生成yum缓存
#sudo yum makecache
sudo yum install -y nvidia-docker2
### Restart the Docker daemon to complete the installation after setting the default runtime:
sudo systemctl restart docker
|
验证
1
2
| ### t this point, a working setup can be tested by running a base CUDA container:
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
|
安装成功,如下结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:3B:00.0 Off | Off |
| N/A 37C P0 33W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:86:00.0 Off | Off |
| N/A 37C P0 32W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:D8:00.0 Off | Off |
| N/A 36C P0 27W / 250W | 0MiB / 16280MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
NVIDIA K8S Device plugin
这里使用镜像方式,更多方式,参考k8s-device-plugin
拉取镜像
1
2
| docker pull nvidia/k8s-device-plugin:v0.7.3
docker tag nvidia/k8s-device-plugin:v0.7.3 nvidia/k8s-device-plugin:devel
|
运行镜像
以下方式2选1:
Without compatibility for the CPUManager static policy:
1
2
3
4
5
6
7
| docker run \
-it \
--security-opt=no-new-privileges \
--cap-drop=ALL \
--network=none \
-v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \
nvidia/k8s-device-plugin:devel
|
With compatibility for the CPUManager static policy:
1
2
3
4
5
6
| docker run \
-it \
--privileged \
--network=none \
-v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \
nvidia/k8s-device-plugin:devel --pass-device-specs
|
附录
手动安装nvidia-docker(在有外网机器上面进行),
未测试验证,仅供参考
1
2
3
4
5
6
7
8
9
| distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install --downloadonly nvidia-docker2 --downloaddir=/tmp/nvidia
##在拷贝到没有网路的服务器上面执行以下命令
rpm -ivh libnvidia-container1-1.1.1-1.x86_64.rpm libnvidia-container-tools-1.1.1-1.x86_64.rpm
rpm -ivh nvidia-container-runtime-3.2.0-1.x86_64.rpm nvidia-container-toolkit-1.1.2-2.x86_64.rpm
|