1、前提条件:需要有 k8s 环境,这里使用的是阿里云 serverless k8s (需要安装coredns组件)
2、dask k8s 环境配置,部署 dask k8s operator
$ helm repo add dask https://helm.dask.org
$ helm repo update
$ helm repo list
NAME URL
dask https://helm.dask.org
$ helm search repo dask
$ helm pull dask/dask-kubernetes-operator
$ tar xvf dask-kubernetes-operator-2023.8.1.tgz
$ cd dask-kubernetes-operator
$ vim values.yaml
image:
# name: ghcr.io/dask/dask-kubernetes-operator # Docker image for the operator
# 把镜像改成从南京大学ghcr.io镜像源拉取,避免拉取超时
name: ghcr.nju.edu.cn/dask/dask-kubernetes-operator # Docker image for the operator
tag: "2023.8.1" # Release version
pullPolicy: IfNotPresent # Pull policy
// 部署 dask k8s operator
$ helm install dask-kubernetes-operator-2023.8.1 ./dask-kubernetes-operator --values ./dask-kubernetes-operator/values.yaml
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
dask-kubernetes-operator-2023.8.1-68cd86f7cc-n2fsq 1/1 Running 0 1h
3、DaskJob 使用
a .这里通过 annotations 使用了阿里云 serverless k8s 的 eci pod 竞价实例,可以节省部分成本
b. 里面的 image url 改成了南京大学的镜像源,避免拉取超时
$ cat dask-job.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskJob
metadata:
name: simple-job
namespace: default
spec:
job:
spec:
containers:
- name: job
# image: "m.daocloud.io/ghcr.io/dask/dask:latest"
# 使用 Python Dask 做分布式计算的业务代码,应该打包为单独的业务镜像来使用才对
# 这里为了方便,直接用官方的镜像和示例代码来测试
image: "ghcr.nju.edu.cn/dask/dask:latest"
imagePullPolicy: "IfNotPresent"
args:
- python
- -c
- "from dask.distributed import Client; client = Client(); print(client) # Do some work..."
cluster:
spec:
worker:
replicas: 2
metadata:
annotations:
k8s.aliyun.com/eci-spot-strategy: SpotAsPriceGo
k8s.aliyun.com/eci-use-specs: 4-8Gi
spec:
containers:
- name: worker
# image: "m.daocloud.io/ghcr.io/dask/dask:latest"
image: "ghcr.nju.edu.cn/dask/dask:latest"
imagePullPolicy: "IfNotPresent"
args:
- dask-worker
- --name
- $(DASK_WORKER_NAME)
- --dashboard
- --dashboard-address
- "8788"
ports:
- name: http-dashboard
containerPort: 8788
protocol: TCP
env:
- name: WORKER_ENV
value: hello-world # We dont test the value, just the name
scheduler:
metadata:
annotations:
k8s.aliyun.com/eci-spot-strategy: SpotAsPriceGo
k8s.aliyun.com/eci-use-specs: 2-4Gi
spec:
containers:
- name: scheduler
# image: "m.daocloud.io/ghcr.io/dask/dask:latest"
image: "ghcr.nju.edu.cn/dask/dask:latest"
imagePullPolicy: "IfNotPresent"
args:
- dask-scheduler
ports:
- name: tcp-comm
containerPort: 8786
protocol: TCP
- name: http-dashboard
containerPort: 8787
protocol: TCP
readinessProbe:
httpGet:
port: http-dashboard
path: /health
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
port: http-dashboard
path: /health
initialDelaySeconds: 15
periodSeconds: 20
env:
- name: SCHEDULER_ENV
value: hello-world
service:
type: ClusterIP
#type: LoadBalancer
selector:
dask.org/cluster-name: simple-job
dask.org/component: scheduler
ports:
- name: tcp-comm
protocol: TCP
port: 8786
targetPort: "tcp-comm"
- name: http-dashboard
protocol: TCP
port: 8787
targetPort: "http-dashboard"
$ kubectl apply -f dask-job.yaml
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
dask-kubernetes-operator-2023.8.1-68cd86f7cc-n2fsq 1/1 Running 0 20h
simple-job-default-worker-6cc619da50-66b7647d89-k9kqd 1/1 Running 0 28s
simple-job-default-worker-e5bf0bbc1e-6ff6c877b4-d4ksj 1/1 Running 0 28s
simple-job-runner 1/1 Running 0 29s
simple-job-scheduler-5db7df9769-v8926 1/1 Running 0 28s
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 192.168.0.1 <none> 443/TCP 22h
simple-job-scheduler ClusterIP 192.168.116.60 <none> 8786/TCP,8787/TCP 48s
// 查看运行结果
$ kubectl logs --tail=200 -f simple-job-runner
+ '[' '' ']'
+ '[' '' == true ']'
+ CONDA_BIN=/opt/conda/bin/conda
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
+ '[' '' ']'
no environment.yml
+ '[' '' ']'
+ exec python -c 'from dask.distributed import Client; client = Client(); print(client)# Do some work...'
<Client: 'tcp://172.26.219.61:8786' processes=0 threads=0, memory=0 B>
// 运行结束后,其他 pod 自动清理了
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
dask-kubernetes-operator-2023.8.1-68cd86f7cc-n2fsq 1/1 Running 0 20h
simple-job-runner 0/1 Completed 1 2m39s
参考资料: