Alibaba Cloud Shared GPU Solution Test

Before deployment, you need to ensure that nvidia-driver and nvidia-docker are installed on the k8s node, and the default runtime of docker is set to nvidia

# cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia",
}

1. Install gpushare-device-plugin in helm

1
2
3

$ git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
$ cd gpushare-scheduler-extender/deployer/chart
$ helm install --name gpushare --namespace kube-system --set masterCount=3 gpushare-installer

2. Label the GPU node

1
2

$ kubectl label node sd-cluster-04 gpushare=true
$ kubectl label node sd-cluster-05 gpushare=true

3. Install kubectl-inspect-gpushare

You need to install kubectl in advance, so it is omitted here

1
2
3

$ cd /usr/bin/
$ wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
$ chmod u+x /usr/bin/kubectl-inspect-gpushare

Check the current k8s cluster GPU resource usage

$ kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
sd-cluster-04 192.168.1.214 0/14 0/14
sd-cluster-05 192.168.1.215 8/14 8/14
----------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
8/28 (28%)

Of course, GPU nodes should be disabled from sharing GPU resources. Just set gpushare=false

1
2

$ kubectl label node sd-cluster-04 gpushare=false
$ kubectl label node sd-cluster-05 gpushare=false

2. Verification test

1. Deploy the first application

Apply for 2G GPU memory and test the application. It should be allocated to one card

apiVersion: apps/v1
kind: Deployment metadata: name: binpack-1 labels: app: binpack-1 spec: replicas: 1 selector: # define how the deployment finds the pods it mangas matchLabels: app: binpack-1 template: # define the pods specifications metadata: labels: app: binpack-1 spec: containers: - name: binpack-1 image: cheyang/gpu-player:v2 resources: limits: # GiB aliyun.com/gpu-mem: 2 ```

```shell $ kubectl apply -f 1.yaml -n test $ kubectl inspect gpushare NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB) sd-cluster-04 192.168.1.214 0/14 0/14 sd-cluster- 05 192.168.1.215 2/14 2/14 ---------------------------------------- ------------------ Allocated/Total GPU Memory In Cluster: 2/28 (7%) $ kubectl get pod -n test NAME READY STATUS RESTARTS AGE binpack-1-6d6955c487 -j4c4b 1/1 Running 0 28m $ kubectl logs -f binpack-1-6d6955c487-j4c4b -n test ALIYUN_COM_GPU_MEM_DEV=14 ALIYUN_COM_GPU_MEM_CONTAINER=2 2021-08-13 02:47:22.395557: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4. 2 AVX AVX2 AVX512F FMA 2021-08-13 02:47:22.552831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz ): 1.59 pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2021-08-13 02:47:22.552873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)

2. Deploy the second application.

Now apply for 8G memory, 2 instances, a total of 16G memory

apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-2
  labels:
    app: binpack-2

spec:
  replicas: 2

  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: binpack-2

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-2

    spec:
      containers:
      - name: binpack-2
        image: cheyang/gpu-player:v2
        resources:
          limits:
            aliyun.com/gpu-mem: 8

$ kubectl apply -f 2.yaml -n test
$ kubectl inspect gpushare
NAME           IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
sd-cluster-04  192.168.1.214  8/14                   8/14
sd-cluster-05  192.168.1.215  10/14                  10/14
----------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
18/28 (64%)
$ kubectl get pod -n test
NAME                               READY   STATUS    RESTARTS   AGE
binpack-1-6d6955c487-j4c4b         1/1     Running   0          28m
binpack-2-58579b95f7-4wpbl         1/1     Running   0          27m
$ kubectl logs -f binpack-2-58579b95f7-4wpbl -n test
ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=8
2021-08-13 02:48:41.246585: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-08-13 02:48:41.338992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 13.07GiB
2021-08-13 02:48:41.339031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)

From the resource usage, we can see that the second application has used 8G memory of two cards respectively

3. Deploy the first Three applications

Apply for 2G video memory

apiVersion: apps/v1
kind: Deployment
metadata:
name: binpack-3
labels:
app: binpack-3
spec:
replicas: 1

selector: # define how the deployment finds the pods it manages
matchLabels :
app: binpack-3

template: # define the pods specifications
metadata:
labels:
app: binpack-3

spec:
containers:
- name: binpack-3
image: cheyang/gpu-player:v2
resources:
limits:
aliyun.com/gpu -mem: 2 ``` 


```shell
$ kubectl apply -f 3.yaml -n test
$ kubectl inspect gpushare
NAME           IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
sd-cluster-04  192.168.1.214  8/14                   8/14
sd-cluster-05  192.168.1.215  12/14                  12/14
----------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
20/28 (71%)
$ kubectl get pod -n test
NAME                               READY   STATUS    RESTARTS   AGE
binpack-1-6d6955c487-j4c4b         1/1     Running   0          28m
binpack-2-58579b95f7-4wpbl         1/1     Running   0          27m
binpack-2-58579b95f7-sjhwt         1/1     Running   0          27m
binpack-3-556bbd84f9-9xqg7         1/1     Running   0          14m
$ kubectl logs -f binpack-3-556bbd84f9-9xqg7 -n test
ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=2
2021-08-13 03:01:53.897423: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-08-13 03:01:54.008665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:af:00.0
totalMemory: 14.75GiB freeMemory: 7.08GiB
2021-08-13 03:01:54.008716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)

After deploying the third application, the maximum available video memory is 8G, but in fact we can only use 6G at most when deploying applications, because the same task cannot be distributed to different GPU cards

4. Deploy the fourth application

Apply for 5G video memory, which should be scheduled to sd-cluster-04

apiVersion: apps/v1
kind: Deployment

metadata:
name: binpack-4
labels:
app: binpack-4

spec:
replicas: 1

selector: # define how the deployment finds the pods it manages
matchLabels:
app: binpack-4

template: # define the pods specifications
metadata:
labels:
app: binpack-4

spec:
containers:
- name: binpack-4
image: cheyang/gpu-player:v2
resources:
limits:
aliyun.com/gpu-mem: 5

1
2
3

$ kubectl apply -f 4.yaml -n test
$ kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB) sd-cluster-04 192.168.1.214 13/14 13/14 sd-cluster-05 192.168.1.215 12/14 12/14 --------------------------------------------------------------- Allocated/Total GPU Memory In Cluster: 25/28 (89%) $ kubectl get pod -n test NAME READY STATUS RE STARTS AGE binpack-1-6d6955c487-j4c4b 1/1 Running 0 26m binpack-2-58579b95f7-4wpbl 1/1 Running 0 24m binpack-2-58579b95f7-sjhwt 1/1 Running 0 24m binpack-3-556bbd84f9-9xqg7 1/1 Running 0 11m binpack-4-6956458f85-cv62j 1/1 Running 0 6s $ kubectl logs -f binpack-4-6956458f85-cv62j -n test ALIYUN_COM_GPU_MEM_DEV=14 ALIYUN_COM _GPU_MEM_CONTAINER=5 2021-08-13 03:13:20.208122: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA 2021-08-13 03:13:20.361391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:af:00.0 totalMemory: 14.75GiB freeMemory: 6.46GiB 2 021-08-13 03:13:20.361481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)

3. Summary

The following requirements cannot be met when using gpushare-device-plugin

The same task cannot use GPU cards of multiple machines at the same time (shared)
Resources cannot be allocated by GPU load percentage under the same card

However, it is completely sufficient for algorithm team model testing. There are actually two other GPU sharing solutions. I will not introduce them here. If you need them, just refer to the official warehouse configuration.

https://github.com/tkestack/gpu-manager https://github.com/vmware/bitfusion-with-kubernetes-integration

References: https://github.com/AliyunContainerService/gpushare-scheduler-extender/tree/master/deployer

1. Deploy GPU sharing plug-in in k8s