最近新上了一个AI项目,主要是给大学提供AI在线实验,项目也采购了GPU服务器,但是只有一张Nvidia Tesla T4卡,需要支持多个学生同时在线做实验呢。
在线实验系统目前运行在Kubernetes上,因此,需要考虑k8s环境下,GPU共享,之前也测试过阿里云GPU卡共享方案,这里就直接记录使用步骤即可:
kubernetes集群版本:1.23.1
调整K8S调度器
从1.23.+起,由于kube-scheduler调度策略有调整,所以之前的部署方式就不好时了。
参考这里:https://kubernetes.io/docs/reference/scheduling/policies/
1
2
|
git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
cp config/scheduler-policy-config.yaml /etc/kubernetes/
|
由于我这里是通过kubeasz安装的k8s,所以Kubeconfig放在这里/etc/kubernetes/kubelet.kubeconfig,需要更新 /etc/kubernetes/scheduler-policy-config.yaml如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
---
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: /etc/kubernetes/kube-scheduler.kubeconfig
extenders:
- urlPrefix: "http://192.168.233.101:32766/gpushare-scheduler"
filterVerb: filter
bindVerb: bind
enableHTTPS: false
nodeCacheCapable: true
managedResources:
- name: aliyun.com/gpu-mem
ignoredByScheduler: false
ignorable: false
|
修改Kube-scheduler启动启动参数,/etc/systemd/system/kube-scheduler.service
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
[Unit]
Description=Kubernetes Scheduler
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
[Service]
ExecStart=/opt/kube/bin/kube-scheduler \
--authentication-kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \
--authorization-kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \
--bind-address=0.0.0.0 \
--kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig \
--config=/etc/kubernetes/scheduler-policy-config.yaml \
--leader-elect=true \
--v=2
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
|
1
2
|
systemctl daemon-reload
systemctl restart kube-scheduler.service
|
安装调度器扩展器
给GPU机器打上label:
1
|
kubectl label node mynode gpushare=true
|
1
2
|
cd gpushare-scheduler-extender/config
kubectl apply -f gpushare-schd-extender.yaml
|
安装插件
1
2
|
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
|
安装完成以后如下
1
2
3
4
|
# kubectl get pod -n kube-system |grep gpushare
gpushare-device-plugin-ds-j8blj 1/1 Running 0 30h
gpushare-schd-extender-74796c5f64-7g4bl 1/1 Running 0 29h
|
测试GPU
修改samples/1.yaml如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
apiVersion: apps/v1
kind: Deployment
metadata:
name: binpack-1
labels:
app: binpack-1
spec:
replicas: 1
selector: # define how the deployment finds the pods it mangages
matchLabels:
app: binpack-1
template: # define the pods specifications
metadata:
labels:
app: binpack-1
spec:
containers:
- name: binpack-1
image: cheyang/gpu-player:v2
resources:
limits:
# GiB
aliyun.com/gpu-count: 1
aliyun.com/gpu-mem: 5
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# kubectl apply -f samples/1.yaml
deployment.apps/binpack-1 created
# kubectl get pod |grep binpack
binpack-1-9995bdf69-pk2d4 1/1 Running 0 12s
# kubectl logs -f binpack-1-9995bdf69-pk2d4
ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=5
2023-06-30 09:40:50.890296: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2023-06-30 09:40:50.976283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:03:00.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2023-06-30 09:40:50.976313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:03:00.0, compute capability: 7.5)
|
通过日志看,GPU共享已经成功,且申请了5G显存。
当然你也可以通过kubectl扩展工具kubectl-inspect-gpushare来查看:
1
2
3
4
5
6
7
8
9
|
# cd /usr/bin/
# wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
# chmod u+x /usr/bin/kubectl-inspect-gpushare
# kubectl-inspect-gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
192.168.233.101 192.168.233.101 9/14 9/14
--------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
9/14 (64%)
|
至此,GPU卡共享已经完成,剩下就是通过tensorflow来调度我的GPU卡了。
参考文档:
https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md