Recently, a new AI project was launched, which mainly provides AI online experiments for universities. The project also purchased a GPU server, but there is only one Nvidia Tesla T4 card, which needs to support multiple students to do experiments online at the same time.
The online experiment system is currently running on Kubernetes, so it is necessary to consider GPU sharing in the k8s environment. Alibaba Cloud GPU card sharing solution has been tested before. Here we can directly record the usage steps:
kubernetes cluster version: 1.23.1
Adjust K8S scheduler
Since 1.23.+, due to the adjustment of kube-scheduler scheduling strategy, the previous deployment method is not good.
Since I installed k8s through kubeasz here, Kubeconfig is placed here /etc/kubernetes/kubelet.kubeconfig, and you need to update /etc/kubernetes/scheduler-policy-config.yaml as follows:
apiVersion:apps/v1kind:Deploymentmetadata:name:binpack-1labels:app:binpack-1spec:replicas:1selector:# define how the deployment finds the pods it mangagesmatchLabels:app:binpack-1template:# define the pods specificationsmetadata:labels:app:binpack-1spec:containers:- name:binpack-1image:cheyang/gpu-player:v2resources:limits:# GiBaliyun.com/gpu-count:1aliyun.com/gpu-mem:5
# kubectl apply -f samples/1.yamldeployment.apps/binpack-1 created# kubectl get pod |grep binpackbinpack-1-9995bdf69-pk2d4 1/1 Running 0 12s# kubectl logs -f binpack-1-9995bdf69-pk2d4ALIYUN_COM_GPU_MEM_DEV=14ALIYUN_COM_GPU_MEM_CONTAINER=52023-06-30 09:40:50.890296: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use:SSE4.1 SSE4.2 AVX AVX2 FMA2023-06-30 09:40:50.976283:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz):1.59pciBusID:0000:03:00.0totalMemory: 14.75GiB freeMemory:14.66GiB2023-06-30 09:40:50.976313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:03:00.0, compute capability:7.5)
From the log, we can see that GPU sharing has been successful and 5G video memory has been applied.
Of course, you can also use the kubectl extension tool kubectl-inspect-gpushare to check: