一、k8s部署GPU共享插件 部署之前需要确保k8s节点上已安装nvidia-driver和nvidia-docker,同时已将docker默认运行时设置为nvidia
Copy 1
2
3
4
5
6
7
8
9
10
# cat /etc/docker/daemon.json
{
"runtimes" : {
"nvidia" : {
"path" : "/usr/bin/nvidia-container-runtime" ,
"runtimeArgs" : []
}
} ,
"default-runtime" : "nvidia" ,
}
1、helm安装gpushare-device-plugin Copy 1
2
3
$ git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
$ cd gpushare-scheduler-extender/deployer/chart
$ helm install --name gpushare --namespace kube-system --set masterCount = 3 gpushare-installer
2、打Label,标记GPU节点 Copy 1
2
$ kubectl label node sd-cluster-04 gpushare = true
$ kubectl label node sd-cluster-05 gpushare = true
3、安装kubectl-inspect-gpushare 需要提前安装kubectl,这里就省略
Copy 1
2
3
$ cd /usr/bin/
$ wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
$ chmod u+x /usr/bin/kubectl-inspect-gpushare
查看当前k8s集群GPU资源使用情况
Copy 1
2
3
4
5
6
7
$ kubectl inspect gpushare
NAME IPADDRESS GPU0( Allocated/Total) GPU Memory( GiB)
sd-cluster-04 192.168.1.214 0/14 0/14
sd-cluster-05 192.168.1.215 8/14 8/14
----------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
8/28 ( 28%)
当然要禁用GPU节点共享GPU资源,直接设置gpushare=false即可
Copy 1
2
$ kubectl label node sd-cluster-04 gpushare = false
$ kubectl label node sd-cluster-05 gpushare = false
二、验证测试 1、部署第一个应用 申请2G GPU显存,测试应用,应该分配到一张卡上
Copy 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion : apps/v1
kind : Deployment
metadata :
name : binpack-1
labels :
app : binpack-1
spec :
replicas : 1
selector : # define how the deployment finds the pods it mangages
matchLabels :
app : binpack-1
template : # define the pods specifications
metadata :
labels :
app : binpack-1
spec :
containers :
- name : binpack-1
image : cheyang/gpu-player:v2
resources :
limits :
# GiB
aliyun.com/gpu-mem : 2
Copy 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ kubectl apply -f 1.yaml -n test
$ kubectl inspect gpushare
NAME IPADDRESS GPU0( Allocated/Total) GPU Memory( GiB)
sd-cluster-04 192.168.1.214 0/14 0/14
sd-cluster-05 192.168.1.215 2/14 2/14
----------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
2/28 ( 7%)
$ kubectl get pod -n test
NAME READY STATUS RESTARTS AGE
binpack-1-6d6955c487-j4c4b 1/1 Running 0 28m
$ kubectl logs -f binpack-1-6d6955c487-j4c4b -n test
ALIYUN_COM_GPU_MEM_DEV = 14
ALIYUN_COM_GPU_MEM_CONTAINER = 2
2021-08-13 02:47:22.395557: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-08-13 02:47:22.552831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate( GHz) : 1.59
pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2021-08-13 02:47:22.552873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device ( /device:GPU:0) -> ( device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)
2、部署第二个应用 现在申请8G内存,2个实例,总共16G内存
Copy 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion : apps/v1
kind : Deployment
metadata :
name : binpack-2
labels :
app : binpack-2
spec :
replicas : 2
selector : # define how the deployment finds the pods it mangages
matchLabels :
app : binpack-2
template : # define the pods specifications
metadata :
labels :
app : binpack-2
spec :
containers :
- name : binpack-2
image : cheyang/gpu-player:v2
resources :
limits :
aliyun.com/gpu-mem : 8
Copy 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ kubectl apply -f 2.yaml -n test
$ kubectl inspect gpushare
NAME IPADDRESS GPU0( Allocated/Total) GPU Memory( GiB)
sd-cluster-04 192.168.1.214 8/14 8/14
sd-cluster-05 192.168.1.215 10/14 10/14
----------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
18/28 ( 64%)
$ kubectl get pod -n test
NAME READY STATUS RESTARTS AGE
binpack-1-6d6955c487-j4c4b 1/1 Running 0 28m
binpack-2-58579b95f7-4wpbl 1/1 Running 0 27m
$ kubectl logs -f binpack-2-58579b95f7-4wpbl -n test
ALIYUN_COM_GPU_MEM_DEV = 14
ALIYUN_COM_GPU_MEM_CONTAINER = 8
2021-08-13 02:48:41.246585: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-08-13 02:48:41.338992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate( GHz) : 1.59
pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 13.07GiB
2021-08-13 02:48:41.339031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device ( /device:GPU:0) -> ( device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)
通过资源占用可以看到,第二个应用,已经分别使用了2张卡的8G内存
3、部署第三个应用 申请2G显存
Copy 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion : apps/v1
kind : Deployment
metadata :
name : binpack-3
labels :
app : binpack-3
spec :
replicas : 1
selector : # define how the deployment finds the pods it mangages
matchLabels :
app : binpack-3
template : # define the pods specifications
metadata :
labels :
app : binpack-3
spec :
containers :
- name : binpack-3
image : cheyang/gpu-player:v2
resources :
limits :
aliyun.com/gpu-mem : 2
Copy 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ kubectl apply -f 3.yaml -n test
$ kubectl inspect gpushare
NAME IPADDRESS GPU0( Allocated/Total) GPU Memory( GiB)
sd-cluster-04 192.168.1.214 8/14 8/14
sd-cluster-05 192.168.1.215 12/14 12/14
----------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
20/28 ( 71%)
$ kubectl get pod -n test
NAME READY STATUS RESTARTS AGE
binpack-1-6d6955c487-j4c4b 1/1 Running 0 28m
binpack-2-58579b95f7-4wpbl 1/1 Running 0 27m
binpack-2-58579b95f7-sjhwt 1/1 Running 0 27m
binpack-3-556bbd84f9-9xqg7 1/1 Running 0 14m
$ kubectl logs -f binpack-3-556bbd84f9-9xqg7 -n test
ALIYUN_COM_GPU_MEM_DEV = 14
ALIYUN_COM_GPU_MEM_CONTAINER = 2
2021-08-13 03:01:53.897423: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-08-13 03:01:54.008665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate( GHz) : 1.59
pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 7.08GiB
2021-08-13 03:01:54.008716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device ( /device:GPU:0) -> ( device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)
部署完第三个应用以后,目前可用最大显存为8G,但是实际我们部署应用的时候最多只能使用6G,因为同一个任务不能分布式的跑到不同的GPU卡上
4、部署第四个应用 申请5G显存,应该是调度到sd-cluster-04上
Copy 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion : apps/v1
kind : Deployment
metadata :
name : binpack-4
labels :
app : binpack-4
spec :
replicas : 1
selector : # define how the deployment finds the pods it mangages
matchLabels :
app : binpack-4
template : # define the pods specifications
metadata :
labels :
app : binpack-4
spec :
containers :
- name : binpack-4
image : cheyang/gpu-player:v2
resources :
limits :
aliyun.com/gpu-mem : 5
Copy 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ kubectl apply -f 4.yaml -n test
$ kubectl inspect gpushare
NAME IPADDRESS GPU0( Allocated/Total) GPU Memory( GiB)
sd-cluster-04 192.168.1.214 13/14 13/14
sd-cluster-05 192.168.1.215 12/14 12/14
------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
25/28 ( 89%)
$ kubectl get pod -n test
NAME READY STATUS RESTARTS AGE
binpack-1-6d6955c487-j4c4b 1/1 Running 0 26m
binpack-2-58579b95f7-4wpbl 1/1 Running 0 24m
binpack-2-58579b95f7-sjhwt 1/1 Running 0 24m
binpack-3-556bbd84f9-9xqg7 1/1 Running 0 11m
binpack-4-6956458f85-cv62j 1/1 Running 0 6s
$ kubectl logs -f binpack-4-6956458f85-cv62j -n test
ALIYUN_COM_GPU_MEM_DEV = 14
ALIYUN_COM_GPU_MEM_CONTAINER = 5
2021-08-13 03:13:20.208122: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-08-13 03:13:20.361391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate( GHz) : 1.59
pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 6.46GiB
2021-08-13 03:13:20.361481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device ( /device:GPU:0) -> ( device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)
三、总结 gpushare-device-plugin使用中有如下需求不能满足
同一个任务没法同时使用多台机器的GPU卡(共享) 同一张卡下没法通过GPU负载百分比来分配资源 不过对于算法团队模型测试已经完全够用了,对于GPU共享方案,其实还有另外两种,这里我就不介绍了,有需要的直接参考官方仓库配置即可。
https://github.com/tkestack/gpu-manager
https://github.com/vmware/bitfusion-with-kubernetes-integration
参考资料:
https://github.com/AliyunContainerService/gpushare-scheduler-extender/tree/master/deployer