Recently, a new AI project was launched, which mainly provides AI online experiments for universities. The project also purchased a GPU server, but there is only one Nvidia Tesla T4 card, which needs to support multiple students to do experiments online at the same time.

The online experiment system is currently running on Kubernetes, so it is necessary to consider GPU sharing in the k8s environment. Alibaba Cloud GPU card sharing solution has been tested before. Here we can directly record the usage steps:

kubernetes cluster version: 1.23.1

Adjust K8S scheduler

Since 1.23.+, due to the adjustment of kube-scheduler scheduling strategy, the previous deployment method is not good.

Refer to here: https://kubernetes.io/docs/reference/scheduling/policies/

1
2
git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
cp config/scheduler-policy-config.yaml /etc/kubernetes/

Since I installed k8s through kubeasz here, Kubeconfig is placed here /etc/kubernetes/kubelet.kubeconfig, and you need to update /etc/kubernetes/scheduler-policy-config.yaml as follows:

1
2
3
4
5
6
7
---
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: /etc/kubernetes/kube-scheduler.kubeconfig
extenders:
- urlPrefix: "http://192.168.233.101:32766/gpushare-scheduler" filterVerb: filter bindVerb: bind enableHTTPS: false nodeCacheCapable: true managedResources: - name: aliyun.com/gpu-mem ignoredByScheduler: false ignorable: false ``` Modify Kube-scheduler startup parameters, /etc/systemd/system/kube-scheduler .service ```yaml [Unit] Description=Kubernetes Scheduler Documentation=https://github.com/GoogleCloudPlatform/kubernetes [Service] ExecStart=/opt/kube/bin/kube-scheduler \ --authentication-kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \ --authorization-kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \ --bind-address=0.0.0.0 \ --kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \ --config=/etc/kubernetes/scheduler-policy-config.yaml \ --leader-elect=true \ --v=2 Restart=always RestartSec=5 [Install] Wanted By=multi-user.target ``` ```shell systemctl daemon-reload systemctl restart kube-scheduler.service ``` ## Install the scheduler extender and label the GPU machine: ``` kubectl label node mynode gpushare=true ``` ``shell cd gpushare-scheduler-extender/config kubectl apply -f gpushare-schd-extender.yaml

Install the plugin

1
2
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml

After the installation is complete, the following is displayed

1
2
3
# kubectl get pod -n kube-system |grep gpushare
gpushare-device-plugin-ds-j8blj 1/1 Running 0 30h
gpushare-schd-extender-74796c5f64-7g4bl 1/1 Running 0 29h

To test the GPU

modify samples/1.yaml as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 1

  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: binpack-1

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-count: 1
            aliyun.com/gpu-mem: 5
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# kubectl apply -f  samples/1.yaml
deployment.apps/binpack-1 created
# kubectl get pod  |grep binpack
binpack-1-9995bdf69-pk2d4                 1/1     Running   0             12s
# kubectl logs -f binpack-1-9995bdf69-pk2d4
ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=5
2023-06-30 09:40:50.890296: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2023-06-30 09:40:50.976283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:03:00.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2023-06-30 09:40:50.976313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:03:00.0, compute capability: 7.5)

From the log, we can see that GPU sharing has been successful and 5G video memory has been applied.

Of course, you can also use the kubectl extension tool kubectl-inspect-gpushare to check:

1
2
3
4
5
6
7
8
9
# cd /usr/bin/
# wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
# chmod u+x /usr/bin/kubectl-inspect-gpushare
# kubectl-inspect-gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
192.168.233.101 192.168.233.101 9/14 9/14
--------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
9/14 (64%)

At this point, GPU card sharing has been completed, and the only thing left is to schedule my GPU card through tensorflow.

Reference document:

https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md