Calico部署完后pod状态显示CrashLoopBackOff如何处理?
地址://www.greatytc.com/p/87a01ec9964c
环境准备
如题,在开始之前我们需要一个干净 Kubernetes 集群,这里说的干净是指没有被网络插件干预过的集群。所以我这里准备如下三个节点:
IP Role OS
10.0.1.111 Master、Node Ubuntu 18.04
10.0.1.112 Master、Node Ubuntu 18.04
10.0.1.113 Node Ubuntu 18.04
使用系统版本为 Ubuntu 18.04,这里就直接使用 Ansible Role 的方式来快速以二进制形式创建一个干净的 Kubernetes 集群
Ansible部署二进制的k8s
地址://www.greatytc.com/p/85edca636ddc
修改主机清单 hosts.yaml 配置内容如下:
all:
vars:
ansible_user: root
ansible_ssh_pass: root1234
ansible_sudo_pass: root1234
is_mutil_master: yes
virtual_ip: 10.0.1.110
virtual_ip_device: ens33
service_net: 10.0.0.0/24
pod_net: 10.244.0.0/16
proxy_master_port: 7443
install_dir: /opt/apps/
package_dir: /opt/packages/
tls_dir: /opt/k8s_tls
ntp_host: ntp1.aliyun.com
have_network: yes
replace_repo: yes
docker_registry_mirrors: https://7hsct51i.mirror.aliyuncs.com
kubelet_bootstrap_token: 8fba966b6e3b5d182960a30f6cb94428
pause_image: registry.cn-shenzhen.aliyuncs.com/zze/pause:3.2
dashboard_port: 30001
dashboard_token_file: dashboard_token.txt
ingress_controller_type: nginx
hosts:
10.0.1.111:
hostname: k8s-master1
master: yes
node: yes
etcd: yes
proxy_master: yes
proxy_priority: 110
10.0.1.112:
hostname: k8s-master2
master: yes
node: yes
etcd: yes
proxy_master: yes
proxy_priority: 100
10.0.1.113:
hostname: k8s-node1
etcd: yes
node: yes
ingress: yes
通过如上配置可以构建一个由三节点组成的 2 Master + 3 Node 的 Kubernetes 集群,开始执行 Playbook:
$ ansible-playbook -i hosts.yml run.yml --skip-tag=deploy_manifests
...
TASK [start_service : 签发 Kubelet 申请的证书 - 签发证书 (2/2)] ***************************************************************************************************************************************************************************
skipping: [10.0.1.112]
skipping: [10.0.1.113]
changed: [10.0.1.111]
PLAY RECAP *********************************************************************************************************************************************************************************************************************
10.0.1.111 : ok=88 changed=46 unreachable=0 failed=0 skipped=20 rescued=0 ignored=0
10.0.1.112 : ok=68 changed=32 unreachable=0 failed=0 skipped=15 rescued=0 ignored=0
10.0.1.113 : ok=48 changed=20 unreachable=0 failed=0 skipped=35 rescued=0 ignored=0
此 Ansible
默认在部署完 Kubernetes
的基本组件后还会自动安装网络插件、CoreDNS
、Dashboard
等附件,这里通过 --skip-tag=deploy-manifests
来忽略这些步骤
由于此 Ansible
默认是使用 Flannel
作为 cni
插件实现的,所以预装了一些 Flannel
二进制包,可以在各个节点中将其删除:
$ rm -f /opt/apps/cni/bin/*
至此,一个干净的 Kubernetes 集群就已经构建完成,可以看到它的各个节点如下:
$ kubectl get node
NAME STATUS ROLES AGE VERSION
k8s-master1 NotReady <none> 38m v1.19.3
k8s-master2 NotReady <none> 38m v1.19.3
k8s-node1 NotReady <none> 38m v1.19.3
这里由于还没有安装网络插件,所以它处于 NotReady
状态,咱们继续下面的 Calico
部署步骤做完它们就会成为 Ready
状态了。
Calico 部署
从官网下载资源文件:
$ wget https://docs.projectcalico.org/manifests/calico-etcd.yaml
下面只列出修改的部分:
$ vim calico-etcd.yaml
...
# 这里反引号包裹的内容表示需要执行它将其结果替换到此处
# etcd 证书私钥
etcd-key: `cat /opt/k8s_tls/etcd/server-key.pem | base64 -w 0`
# etcd 证书
etcd-cert: `cat /opt/k8s_tls/etcd/server.pem | base64 -w 0`
# etcd CA 证书
etcd-ca: `cat /opt/k8s_tls/etcd/ca.pem | base64 -w 0`
...
# etcd 集群地址
etcd_endpoints: "https://10.0.1.111:2379,https://10.0.1.112:2379,https://10.0.1.113:2379"
etcd_ca: "/calico-secrets/etcd-ca"
etcd_cert: "/calico-secrets/etcd-cert"
etcd_key: "/calico-secrets/etcd-key"
...
# 禁止使用 IPIP 模式
- name: CALICO_IPV4POOL_IPIP
value: "Never"
# 设置 Pod IP 地址段,此处 value 应该与之前配置的 hosts.yaml 中的 pod_net 变量值一致
- name: CALICO_IPV4POOL_CIDR
value: "10.244.0.0/16"
...
# 修改 cni 插件二进制文件映射到宿主机的目录,此处 /opt/apps 与 hosts.yaml 中的 install_dir 变量值一致
- name: cni-bin-dir
hostPath:
path: /opt/apps/cni/bin
# 修改 cni 配置目录为手动指定的目录,此处 /opt/apps 与 hosts.yaml 中的 install_dir 变量值一致
- name: cni-net-dir
hostPath:
path: /opt/apps/cni/conf
# 修改 cni 日志目录为手动指定的目录,此处 /opt/apps 与 hosts.yaml 中的 install_dir 变量值一致
- name: cni-log-dir
hostPath:
path: /opt/apps/cni/log
# 修改此卷的挂载权限为 0440,有两处
- name: etcd-certs
secret:
secretName: calico-etcd-secrets
defaultMode: 0440
由于该资源文件使用的镜像源在国外,我将它们 download 下来后上传到了阿里云仓库,可以执行下面操作进行替换
$ sed -i 's#docker.io/calico/cni:v3.18.0#registry.cn-shenzhen.aliyuncs.com/zze/calico-cni:v3.18.0#g;s#docker.io/calico/pod2daemon-flexvol:v3.18.0#registry.cn-shenzhen.aliyuncs.com/zze/calico-pod2daemon-flexvol:v3.18.0#g;s#docker.io/calico/node:v3.18.0#registry.cn-shenzhen.aliyuncs.com/zze/calico-node:v3.18.0#g;s#docker.io/calico/kube-controllers:v3.18.0#registry.cn-shenzhen.aliyuncs.com/zze/calico-kube-controllers:v3.18.0#g' calico-etcd.yaml
注意:此时下载的 YAML 镜像版本为 v3.18.0
应用修改好的资源文件:
kubectl apply -f calico-etcd.yaml
secret/calico-etcd-secrets created
configmap/calico-config created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
poddisruptionbudget.policy/calico-kube-controllers created
稍等片刻会在 kube-system
命名空间下启动如下 Pod:
$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-79678fdb96-5w4kl 1/1 Running 0 16m 10.0.1.111 k8s-master1 <none> <none>
calico-node-hsm8s 1/1 Running 0 16m 10.0.1.112 k8s-master2 <none> <none>
calico-node-qnm9r 1/1 Running 0 16m 10.0.1.113 k8s-node1 <none> <none>
calico-node-t4cjq 1/1 Running 0 16m 10.0.1.111 k8s-master1 <none> <none>
测试一下 Pod 的跨主机通信,应用如下资源文件:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test
name: test
spec:
replicas: 3
selector:
matchLabels:
app: test
strategy: {}
template:
metadata:
labels:
app: test
spec:
containers:
- image: busybox:latest
command: ['sleep','3000']
name: busybox
成功应用后将会创建如下三个 Pod:
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-c4f594994-nl2ks 1/1 Running 0 2m49s 10.244.0.5 k8s-node1 <none> <none>
test-c4f594994-s48pl 1/1 Running 0 2m49s 10.244.2.2 k8s-master1 <none> <none>
test-c4f594994-wv6nj 1/1 Running 0 2m49s 10.244.1.2 k8s-master2 <none> <none>
也可以在各 Node 上查看到由 Calico 管理的路由信息
随便进入一个 Pod 测试 ping 其它两个 Pod:
$ kubectl exec -it test-c4f594994-nl2ks -- sh
/ # ping 10.244.2.2
PING 10.244.2.2 (10.244.2.2): 56 data bytes
64 bytes from 10.244.2.2: seq=0 ttl=62 time=0.474 ms
^C
--- 10.244.2.2 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.474/0.474/0.474 ms
/ # ping 10.244.1.2
PING 10.244.1.2 (10.244.1.2): 56 data bytes
64 bytes from 10.244.1.2: seq=0 ttl=62 time=0.321 ms
^C
--- 10.244.1.2 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.321/0.321/0.321 ms
可以正常通信,说明 Calico 已经正常在 Kubernetes 集群中工作了。
参考官网:https://docs.projectcalico.org/getting-started/kubernetes/self-managed-onprem/onpremises
在完成这篇文章之后。我已经对上述使用的 Ansible Role
进行了增强,以对 Calico
提供支持,所以你如果想要在 Kubernetes
集群中应用 Calico
,直接使用我的 Ansible Role
就可以一键部署完成。地址://www.greatytc.com/p/85edca636ddc