Before deployment, you need to ensure that nvidia-driver and nvidia-docker are installed on the k8s node, and the default runtime of docker is set to nvidia
apiVersion:apps/v1kind:Deploymentmetadata:name:binpack-2labels:app:binpack-2spec:replicas:2selector:# define how the deployment finds the pods it mangagesmatchLabels:app:binpack-2template:# define the pods specificationsmetadata:labels:app:binpack-2spec:containers:- name:binpack-2image:cheyang/gpu-player:v2resources:limits:aliyun.com/gpu-mem:8
apiVersion:apps/v1kind:Deploymentmetadata:name:binpack-3labels:app:binpack-3spec:replicas:1selector:# define how the deployment finds the pods it managesmatchLabels :app:binpack-3template:# define the pods specificationsmetadata:labels:app:binpack-3spec:containers:- name:binpack-3image:cheyang/gpu-player:v2resources:limits:aliyun.com/gpu -mem:2``` ```shell$ kubectl apply -f 3.yaml -n test$ kubectl inspect gpushareNAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)sd-cluster-04 192.168.1.214 8/14 8/14sd-cluster-05 192.168.1.215 12/14 12/14----------------------------------------------------------Allocated/Total GPU Memory In Cluster:20/28 (71%)$ kubectl get pod -n testNAME READY STATUS RESTARTS AGEbinpack-1-6d6955c487-j4c4b 1/1 Running 0 28mbinpack-2-58579b95f7-4wpbl 1/1 Running 0 27mbinpack-2-58579b95f7-sjhwt 1/1 Running 0 27mbinpack-3-556bbd84f9-9xqg7 1/1 Running 0 14m$ kubectl logs -f binpack-3-556bbd84f9-9xqg7 -n testALIYUN_COM_GPU_MEM_DEV=14ALIYUN_COM_GPU_MEM_CONTAINER=22021-08-13 03:01:53.897423: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use:SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA2021-08-13 03:01:54.008665:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz):1.59pciBusID:0000:af:00.0totalMemory: 14.75GiB freeMemory:7.08GiB2021-08-13 03:01:54.008716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability:7.5)
After deploying the third application, the maximum available video memory is 8G, but in fact we can only use 6G at most when deploying applications, because the same task cannot be distributed to different GPU cards
4. Deploy the fourth application
Apply for 5G video memory, which should be scheduled to sd-cluster-04
apiVersion:apps/v1kind:Deploymentmetadata:name:binpack-4labels:app:binpack-4spec:replicas:1selector:# define how the deployment finds the pods it managesmatchLabels:app:binpack-4template:# define the pods specificationsmetadata:labels:app:binpack-4spec:containers:- name:binpack-4image:cheyang/gpu-player:v2resources:limits:aliyun.com/gpu-mem:5
$ kubectl apply -f 4.yaml -n test$ kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB) sd-cluster-04 192.168.1.214 13/14 13/14 sd-cluster-05 192.168.1.215 12/14 12/14 --------------------------------------------------------------- Allocated/Total GPU Memory In Cluster: 25/28 (89%) $ kubectl get pod -n test NAME READY STATUS RE STARTS AGE binpack-1-6d6955c487-j4c4b 1/1 Running 0 26m binpack-2-58579b95f7-4wpbl 1/1 Running 0 24m binpack-2-58579b95f7-sjhwt 1/1 Running 0 24m binpack-3-556bbd84f9-9xqg7 1/1 Running 0 11m binpack-4-6956458f85-cv62j 1/1 Running 0 6s $ kubectl logs -f binpack-4-6956458f85-cv62j -n testALIYUN_COM_GPU_MEM_DEV=14 ALIYUN_COM _GPU_MEM_CONTAINER=5 2021-08-13 03:13:20.208122: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA 2021-08-13 03:13:20.361391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:af:00.0 totalMemory: 14.75GiB freeMemory: 6.46GiB 2 021-08-13 03:13:20.361481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5)
3. Summary
The following requirements cannot be met when using gpushare-device-plugin
The same task cannot use GPU cards of multiple machines at the same time (shared)
Resources cannot be allocated by GPU load percentage under the same card
However, it is completely sufficient for algorithm team model testing. There are actually two other GPU sharing solutions. I will not introduce them here. If you need them, just refer to the official warehouse configuration.