K8s 集群的数据备份和恢复确实主要就是etcd 数据库集群的备份和恢复
etcd集群用同一份snapshot恢复
k8s 的数据在etcd 里存储的是怎样的
etcd 是k8s集群极为重要的一块服务,存储了集群所有的数据信息。同理,如果发生灾难或者etcd 的数据丢失,都会影响集群数据的恢复。所以,本层我们重点是如何备份和恢复数据。
开胃菜-etcd简单知识
接上一层,由于etcd 服务做了ca 证书,所以,etcdctl 客户端就必须设置证书参数了,如下:
./etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/kubernetes.pem --key=/etc/kubernetes/ssl/kubernetes-key.pem --endpoints=https://192.168.10.133:2379,https://192.168.10.134:2379 你的命令
简单来几条查看下存储的k8s数据:
查看下集群的状态
[root@k8s-master etcd]# ./etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/kubernetes.pem --key=/etc/kubernetes/ssl/kubernetes-key.pem --endpoints=https://192.168.10.133:2379,https://192.168.10.134:2379 endpoint health
https://192.168.10.133:2379 is healthy: successfully committed proposal: took = 14.251562ms
https://192.168.10.134:2379 is healthy: successfully committed proposal: took = 14.420846ms
获取某个key 的信息
[root@k8s-master etcd]# ./etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/kubernetes.pem --key=/etc/kubernetes/ssl/kubernetes-key.pem --endpoints=https://192.168.10.133:2379,https://192.168.10.134:2379 get /registry/apiregistration.k8s.io/apiservices/v1.apps
/registry/apiregistration.k8s.io/apiservices/v1.apps
{"kind":"APIService","apiVersion":"apiregistration.k8s.io/v1beta1","metadata":{"name":"v1.apps","uid":"93eb024b-fad5-11e9-a51a-000c29a4e4b2","creationTimestamp":"2019-10-30T05:24:42Z","labels":{"kube-aggregator.kubernetes.io/automanaged":"onstart"}},"spec":{"service":null,"group":"apps","version":"v1","groupPriorityMinimum":17800,"versionPriority":15},"status":{"conditions":[{"type":"Available","status":"True","lastTransitionTime":"2019-10-30T05:24:42Z","reason":"Local","message":"Local APIServices are always available"}]}}
获取每个节点的状态
[root@k8s-master etcd]# ./etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/kubernetes.pem --key=/etc/kubernetes/ssl/kubernetes-key.pem --endpoints=https://192.168.10.133:2379,https://192.168.10.134:2379 endpoint status
https://192.168.10.133:2379, 5e881233406036eb, 3.4.3, 3.9 MB, false, false, 1913, 34872, 34872,
https://192.168.10.134:2379, f3796165363d755a, 3.4.3, 3.9 MB, true, false, 1913, 34872, 34872
获取etcd 版本信息
[root@k8s-master etcd]# ./etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/kubernetes.pem --key=/etc/kubernetes/ssl/kubernetes-key.pem --endpoints=https://192.168.10.133:2379,https://192.168.10.134:2379 version
etcdctl version: 3.4.3
API version: 3.4
获取 所有的key
[root@k8s-master etcd]# ./etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/kubernetes.pem --key=/etc/kubernetes/ssl/kubernetes-key.pem https://192.168.10.133:2379,https://192.168.10.134:2379 get / --prefix --keys-only
/registry/apiregistration.k8s.io/apiservices/v1.
/registry/apiregistration.k8s.io/apiservices/v1.apps
/registry/secrets/kube-system/node-controller-token-8pbtv
/registry/secrets/kube-system/persistent-volume-binder-token-tjhmn
/registry/secrets/kube-system/pod-garbage-collector-token-9rbvg
/registry/secrets/kube-system/pv-protection-controller-token-zzqkq
/registry/secrets/kube-system/pvc-protection-controller-token-b2vjh
/registry/secrets/kube-system/replicaset-controller-token-xzrrg
/registry/secrets/kube-system/replication-controller-token-7hzqr
/registry/secrets/kube-system/resourcequota-controller-token-jx6zn
/registry/serviceaccounts/kube-system/certificate-controller
/registry/serviceaccounts/kube-system/clusterrole-aggregation-controller
/registry/serviceaccounts/kube-system/coredns
/registry/serviceaccounts/kube-system/cronjob-controller
/registry/serviceaccounts/kube-system/daemon-set-controller
/registry/serviceaccounts/kube-system/default
/registry/serviceaccounts/kube-system/deployment-controller
/registry/serviceaccounts/kube-system/disruption-controller
/registry/serviceaccounts/kube-system/endpoint-controller
/registry/serviceaccounts/kube-system/expand-controller
/registry/serviceaccounts/kube-system/flannel
/registry/serviceaccounts/kube-system/generic-garbage-collector
/registry/serviceaccounts/kube-system/horizontal-pod-autoscaler
/registry/serviceaccounts/kube-system/job-controller
/registry/serviceaccounts/kube-system/kube-proxy
/registry/services/specs/default/kubernetes
/registry/services/specs/kube-system/kube-dns
基本可以看到,etcd 里存储了所有的k8s 相关的数据,容器pod信息,服务信息,token信息,以及资源信息等。既然所有的东西都是在里面存储,那么在k8s 集群发生故障的时候,只要我们从etcd 备份里恢复数据就可以达到故障前的k8s 集群效果。
备份
注意不同的版本的etcdctl 命令不一样,但大体差不多,这里用的是napshot save , 每次备份一个节点就行。
[root@k8s-master etcd]# ./etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/kubernetes.pem --key=/etc/kubernetes/ssl/kubernetes-key.pem snapshot save /var/lib/etcd_backup/backup_$(date "+%Y%m%d%H%M%S").db
{"level":"info","ts":1572440676.1362343,"caller":"snapshot/v3_snapshot.go:110","msg":"created temporary db file","path":"/var/lib/etcd_backup/backup_20191030210436.db.part"}
{"level":"warn","ts":"2019-10-30T21:04:36.144+0800","caller":"clientv3/retry_interceptor.go:116","msg":"retry stream intercept"}
{"level":"info","ts":1572440676.1444993,"caller":"snapshot/v3_snapshot.go:121","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":1572440676.3010879,"caller":"snapshot/v3_snapshot.go:134","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","took":0.164486462}
{"level":"info","ts":1572440676.3011765,"caller":"snapshot/v3_snapshot.go:143","msg":"saved","path":"/var/lib/etcd_backup/backup_20191030210436.db"}
Snapshot saved at /var/lib/etcd_backup/backup_20191030210436.db
[root@k8s-master etcd]# cd /var/lib/etcd_backup/
[root@k8s-master etcd_backup]# ls
backup_20191030210436.db
恢复
- 暂停kube-apiserver 服务,确保apiserver 服务已经停止运行
# 先把apiserver 文件存储下
mkdir -p /etc/kubernetes/manifests-backups
mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests-backups/
# 检查api-server 是否停止
ps -ef|grep kube-api|grep -v grep |wc -l
0
- 分别在Master节点上,停止ETCD服务
service etcd stop
移除目录下的数据
mv /var/lib/etcd/data.etcd /var/lib/etcd/data.etcd_bak
- 恢复,etcd集群用同一份snapshot恢复
分别在各个节点恢复数据,首先需要拷贝数据到每个master节点, 假设备份数据存在于/var/lib/etcd_backup/backup_20180107172459.db
scp /var/lib/etcd_backup/backup_20180107172459.db root@master1:/var/lib/etcd_backup/
scp /var/lib/etcd_backup/backup_20180107172459.db root@master2:/var/lib/etcd_backup/
scp /var/lib/etcd_backup/backup_20180107172459.db root@master3:/var/lib/etcd_backup/
etcd 的机器上执行
如在192.168.10.133 上。 其实这里可以做成脚本
指定恢复的目录为nodeName.etcd.restore
export ETCDCTL_API=3
etcdctl snapshot restore /var/lib/etcd_backup/backup_20191030210436.db
--data-dir="/var/lib/etcd/etcd-0.etcd.restore" \
--name etcd-0 \
--initial-cluster "etcd-0=https://192.168.10.133:2379,etcd-1=https://192.168.10.134:2379" \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls http://192.168.10.133:2380
脚本:
set -x
export ETCD_NAME=$(cat /usr/lib/systemd/system/etcd.service|grep ExecStart|grep -Eo "name.*-name-[0-9].*--client"|awk '{print $2}')
export ETCD_CLUSTER=$(cat /usr/lib/systemd/system/etcd.service|grep ExecStart|grep -Eo "initial-cluster.*--initial"|awk '{print $2}')
export ETCD_INITIAL_CLUSTER_TOKEN=$(cat /usr/lib/systemd/system/etcd.service|grep ExecStart|grep -Eo "initial-cluster-token.*"|awk '{print $2}')
export ETCD_INITIAL_ADVERTISE_PEER_URLS=$(cat /usr/lib/systemd/system/etcd.service|grep ExecStart|grep -Eo "initial-advertise-peer-urls.*--listen-peer"|awk '{print $2}')
ETCDCTL_API=3 etcdctl snapshot --cacert=/var/lib/etcd/cert/ca.pem --cert=/var/lib/etcd/cert/etcd-client.pem --key=/var/lib/etcd/cert/etcd-client-key.pem restore /var/lib/etcd_backup/backup_20180107172459.db \
--name $ETCD_NAME \
--data-dir /var/lib/etcd/data.etcd \
--initial-cluster $ETCD_CLUSTER \
--initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \
--initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS
chown -R etcd:etcd /var/lib/etcd/data.etcd
在各个节点启动ETCD,并且通过service命令确认启动成功
# service etcd start
# service etcd status
检查集群状态
[root@k8s-master etcd]# ./etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/kubernetes.pem --key=/etc/kubernetes/ssl/kubernetes-key.pem --endpoints=https://192.168.10.133:2379,https://192.168.10.134:2379 endpoint health
https://192.168.10.133:2379 is healthy: successfully committed proposal: took = 14.251562ms
https://192.168.10.134:2379 is healthy: successfully committed proposal: took = 14.420846ms
如果ETCD是健康的,就到每台Master上恢复kube-apiserver
# mv /etc/kubernetes/manifests-backups/kube-apiserver.yaml /etc/kubernetes/manifests/
判断k8s 集群api-server 是否恢复正常
# kubectl get cs
总结
Kubernetes的备份主要是通过ETCD的备份完成的。而恢复时,主要考虑的是整个顺序:停止kube-apiserver,停止ETCD,恢复数据,启动ETCD,启动kube-apiserve