dashboard
监控项:
- 各个k8s集群所有pod not running的状态,监控pod的CrashLoopBackOff及一直处于ContainerCreating的状态,可通过grafana做告警
-
监控pod的重启次数
TechStack
promtheus+grafana
PromQL
datasource接入prometheus,PromQL语句对应上图各项title
- Container Waiting Reason
(sum(kube_pod_container_status_waiting_reason{reason!="ContainerCreating",namespace=~"$namespace",pod=~"$pod"} ) by (reason,namespace,pod) >0)
*on(pod) group_right(reason) sum(kube_pod_info) by (pod,node,host_ip,pod_ip,namespace)
or
(sum(kube_pod_container_status_waiting_reason{reason="ContainerCreating",namespace=~"$namespace",pod=~"$pod"} ) by (reason,namespace,pod) >0)
-on(pod) group_right(reason) sum(kube_pod_info) by (pod,node,host_ip,pod_ip,namespace)
- pod重启次数(Last 5m)
(sum(kube_pod_container_status_restarts_total{namespace=~"$namespace",pod=~"$pod"}) by(namespace,pod) *on(pod) group_right() sum(kube_pod_info) by (pod,node,host_ip,pod_ip,namespace)
-sum(kube_pod_container_status_restarts_total{namespace=~"$namespace",pod=~"$pod"} offset 5m) by(namespace,pod) *on(pod) group_right() sum(kube_pod_info) by (pod,node,host_ip,pod_ip,namespace))