在预期的持续时间内未调用“活力"探针和“就绪"探针 [英] Liveness Probe, Readiness Probe not called in expected duration
问题描述
On GKE, I tried to use readiness probe/ liveness probe , and post alert using monitoring https://cloud.google.com/monitoring/alerts/using-alerting-ui
作为测试,我创建了一个具有就绪探针/活跃度探针的Pod.如我所料,每次探针检查都会失败.
as a test, I create a pod which has readiness probe/ liveness probe. Probe check failed everytime, as I expected.
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-http
spec:
containers:
- name: liveness
image: k8s.gcr.io/liveness
args:
- /server
readinessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 10
successThreshold: 1
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 20
periodSeconds: 60
timeoutSeconds: 30
successThreshold: 1
failureThreshold: 3
然后检查GCP日志,首先根据periodSeconds出现两个错误日志.
And checking GCP log, both error logs showed up based on periodSeconds first.
准备情况调查:每10秒
Readiness probe: every 10 seconds
2021-02-21 13:26:30.000 JST就绪探针失败:HTTP探针失败,状态码:500
2021-02-21 13:26:30.000 JST Readiness probe failed: HTTP probe failed with statuscode: 500
2021-02-21 13:26:40.000 JST就绪探针失败:HTTP探针失败,状态码:500
2021-02-21 13:26:40.000 JST Readiness probe failed: HTTP probe failed with statuscode: 500
活力探针:每1分钟
2021-02-21 13:25:40.000 JST活动性探查失败:HTTP探查失败,状态码:500
2021-02-21 13:25:40.000 JST Liveness probe failed: HTTP probe failed with statuscode: 500
2021-02-21 13:26:40.000 JST活动性探查失败:HTTP探查失败,状态码:500
2021-02-21 13:26:40.000 JST Liveness probe failed: HTTP probe failed with statuscode: 500
但是,在运行了该Pod几分钟后
But, after running this pod several minutes
- 没有再进行活动探针检查
- 调用了就绪探针检查,但是间隔变长了(最大间隔看起来为10分钟)
$ kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
30m Normal Pulling pod/liveness-http Pulling image "k8s.gcr.io/liveness"
25m Warning Unhealthy pod/liveness-http Readiness probe failed: HTTP probe failed with statuscode: 500
20m Warning BackOff pod/liveness-http Back-off restarting failed container
20m Normal Scheduled pod/liveness-http Successfully assigned default/liveness-http to gke-cluster-default-pool-8bc9c75c-rfgc
17m Normal Pulling pod/liveness-http Pulling image "k8s.gcr.io/liveness"
17m Normal Pulled pod/liveness-http Successfully pulled image "k8s.gcr.io/liveness"
17m Normal Created pod/liveness-http Created container liveness
20m Normal Started pod/liveness-http Started container liveness
4m59s Warning Unhealthy pod/liveness-http Readiness probe failed: HTTP probe failed with statuscode: 500
17m Warning Unhealthy pod/liveness-http Liveness probe failed: HTTP probe failed with statuscode: 500
17m Normal Killing pod/liveness-http Container liveness failed liveness probe, will be restarted
在我的计划中,我将创建警报条件,其条件是
In my plan, I would create alert policy, whose condition is like
- 如果活动探测错误在3分钟内发生3次
但是,如果没有按我期望的那样进行探测检查,则这些策略将无法正常工作;即使pod没有运行,警报也已修复
but if probe check didn't called as I expect, these policy didn't work; even if pod is not running, alert became fixed
为什么活力"探针未运行,并且就绪"探针的间隔已更改?
Why Liveness probe didn't run, and interval of Readiness probe changed ?
注意:如果还有其他好的警报策略可以检查pod的活动性,则我不会在意这种行为.如果有人会建议我哪种类型的警报策略最适合检查吊舱,我将不胜感激.
Note: if there are other good alert policy to check liveness of pod, I would not care that behavior. I appreciate it if someone would advice me what kind of alert policy is ideal to check pod.
推荐答案
背景
在
kubelet使用
liveness探针
来了解何时重新启动容器.例如,活动性探针可能会遇到死锁,而死锁是应用程序正在运行的地方,但无法取得进展.在这种状态下重新启动容器可以帮助使该应用程序在存在错误的情况下更加可用.
The kubelet uses
liveness probes
to know when to restart a container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs.
kubelet使用 readiness probes
来了解容器何时准备开始接受流量.当Pod的所有容器都准备就绪时,便视为准备就绪.此信号的一种用法是控制将哪些Pod用作服务的后端.当Pod尚未就绪时,会将其从服务负载平衡器中删除.
The kubelet uses readiness probes
to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.
由于 GKE
主目录由google管理,因此您不会使用 CLI
找到 kubelet
日志(您可能会尝试使用Stackdriver
).我已经在 Kubeadm
集群上对其进行了测试,并将 verbosity
级别设置为 8
.
As GKE
master is managed by google, you won't find kubelet
logs using CLI
(you might try to use Stackdriver
). I've tested it on the Kubeadm
cluster and set the verbosity
level to 8
.
当您使用 $ kubectl获取事件
时,您仅从最近一小时获取事件(可以在Kubernetes设置中更改- Kubeadm
,但我不认为可以在 GKE
中更改它,因为master是由google管理的.)
When you are using $ kubectl get events
you are getting events only from the last hour (it can be changed in Kubernetes settings - Kubeadm
but I don't think it can be changed in GKE
as master is managed by google.)
$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
37m Normal Starting node/kubeadm Starting kubelet.
...
33m Normal Scheduled pod/liveness-http Successfully assigned default/liveness-http to kubeadm
33m Normal Pulling pod/liveness-http Pulling image "k8s.gcr.io/liveness"
33m Normal Pulled pod/liveness-http Successfully pulled image "k8s.gcr.io/liveness" in 893.953679ms
33m Normal Created pod/liveness-http Created container liveness
33m Normal Started pod/liveness-http Started container liveness
3m12s Warning Unhealthy pod/liveness-http Readiness probe failed: HTTP probe failed with statuscode: 500
30m Warning Unhealthy pod/liveness-http Liveness probe failed: HTTP probe failed with statuscode: 500
8m17s Warning BackOff pod/liveness-http Back-off restarting failed container
在〜1小时
之后再次执行相同的命令.
Again the same command after ~1 hour
.
$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
33s Normal Pulling pod/liveness-http Pulling image "k8s.gcr.io/liveness"
5m40s Warning Unhealthy pod/liveness-http Readiness probe failed: HTTP probe failed with statuscode: 500
15m Warning BackOff pod/liveness-http Back-off restarting failed container
测试
每10秒执行一次
Tests
The Readiness Probe
check is executed each 10 seconds for more than one hour.
Mar 09 14:48:34 kubeadm kubelet[3855]: I0309 14:48:34.222085 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 14:48:44 kubeadm kubelet[3855]: I0309 14:48:44.221782 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 14:48:54 kubeadm kubelet[3855]: I0309 14:48:54.221828 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 15:01:34 kubeadm kubelet[3855]: I0309 15:01:34.222491 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4
562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:01:44 kubeadm kubelet[3855]: I0309 15:01:44.221877 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:01:54 kubeadm kubelet[3855]: I0309 15:01:54.221976 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 15:10:14 kubeadm kubelet[3855]: I0309 15:10:14.222163 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:10:24 kubeadm kubelet[3855]: I0309 15:10:24.221744 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:10:34 kubeadm kubelet[3855]: I0309 15:10:34.223877 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 16:04:14 kubeadm kubelet[3855]: I0309 16:04:14.222853 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:04:24 kubeadm kubelet[3855]: I0309 16:04:24.222531 3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
此外,还有 Liveness探针
条目.
Mar 09 16:12:58 kubeadm kubelet[3855]: I0309 16:12:58.462878 3855 prober.go:117] Liveness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:13:58 kubeadm kubelet[3855]: I0309 16:13:58.462906 3855 prober.go:117] Liveness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465470 3855 kuberuntime_manager.go:656] Container "liveness" ({"docker" "95567f85708ffac8b34b6c6f2bdb4
9d8eb57e7704b7b416083c7f296dd40cd0b"}) of pod liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a): Container liveness failed liveness probe, will be restarted
Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465587 3855 kuberuntime_manager.go:712] Killing unwanted container "liveness"(id={"docker" "95567f85708ffac8b34b6c6f2bdb49d8eb57e7704b7b416083c7f296dd40cd0b"}) for pod "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a)"
总测试时间:
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
liveness-http 0/1 Running 21 99m
liveness-http 0/1 CrashLoopBackOff 21 101m
liveness-http 0/1 Running 22 106m
liveness-http 1/1 Running 22 106m
liveness-http 0/1 Running 22 106m
liveness-http 0/1 Running 23 109m
liveness-http 1/1 Running 23 109m
liveness-http 0/1 Running 23 109m
liveness-http 0/1 CrashLoopBackOff 23 112m
liveness-http 0/1 Running 24 117m
liveness-http 1/1 Running 24 117m
liveness-http 0/1 Running 24 117m
结论
没有再进行活动探针检查
Liveness probe check didn't not called anymore
活动检查
是在Kubernetes创建pod时创建的,并且每次重新启动 Pod
时都会重新创建.在您的配置中,您已设置 initialDelaySeconds:20
,因此在创建 pod
之后,Kubernetes将等待20秒,然后将调用 liveness
探针3次(因为您已设置 failureThreshold:3
).3失败后,Kubernetes将根据 RestartPolicy
重新启动此Pod.此外,您还可以在日志中找到日志:
Liveness check
is created when Kubernetes create pod and is recreated each time that Pod
is restarted. In your configuration you have set initialDelaySeconds: 20
so after creating a pod
, Kubernetes will wait 20 seconds, then it will call liveness
probe 3 times (as you have set failureThreshold: 3
). After 3 fails, Kubernetes will restart this pod according to RestartPolicy
. Also in logs you will be able to find in logs:
Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465470 3855 kuberuntime_manager.go:656] Container "liveness" ({"docker" "95567f85708ffac8b34b6c6f2bdb4
9d8eb57e7704b7b416083c7f296dd40cd0b"}) of pod liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a): Container liveness failed liveness probe, will be restarted
为什么要重新启动?可以在容器探针中找到答案.
Why will it be restarted? Answer can be found in Container probes.
livenessProbe:
指示容器是否正在运行.如果活动性探针失败,则kubelet将杀死该容器,并对该容器执行重新启动策略.
livenessProbe:
Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy.
中的默认重启策略> GKE
始终为
.因此,您的pod将会一遍又一遍地重新启动.
Default Restart Policy in GKE
is Always
. So your pod will be restarted over and over again.
调用了就绪探针检查,但是间隔变长了(最大间隔看起来为10分钟)
Readiness probe check called but interval became long ( maximum interval looks about 10 minutes)
我认为您已经得出了这个结论,因为您基于 $ kubectl get事件
和 $ kubectl描述po
.在这两种情况下,默认事件都会在1小时后删除.在我的 Tests
部分中,您可以看到 Readiness probe
条目从 14:48:34
到 16:04:24
,因此Kubernetes每10秒就会调用 Readiness Probe
.
I think you've come to that conclusion as you have based on $ kubectl get events
and $ kubectl describe po
. In both cases, events as default are removed after 1 hour. In my Tests
part you can see that Readiness probe
entries are from 14:48:34
till 16:04:24
, so each 10 seconds Kubernetes calls Readiness Probe
.
为什么活力"探针未运行,并且就绪"探针的间隔已更改?
Why Liveness probe didn't run, and interval of Readiness probe changed?
正如我在 Tests
部分中向您展示的那样, Readiness探针
并没有改变.在这种情况下,误导是使用 $ kubectl事件
.关于 Liveiness Probe
,它仍在调用,但在pod创建后仅3次/重新启动
.另外,我还包括了 $ kubectl get po -w
的输出.重新创建 pod
后,您可能会在kubelet日志中找到这些 liveness探针
.
As I show you in the Tests
part, the Readiness probe
didn't change. Misleading in this case was using $ kubectl events
. Regarding Liveiness Probe
it's still calling but only 3 times after pod will be created
/restarted
. Also I've included output of $ kubectl get po -w
. When pod
is recreated, you might find in kubelet logs those liveness probes
.
在我的计划中,我将创建警报策略,其条件如下:
In my plan, I would create alert policy, whose condition is like:
- 如果活动探测错误在3分钟内发生3次
如果 liveness probe
失败3次,则使用您当前的设置,它将重新启动此Pod.在这种情况下,您可以使用每个 restart
来创建一个 alert
.
If liveness probe
fails 3 times, with your current setup it will restart this pod. In that situation you could use each restart
to create an alert
.
Metric: kubernetes.io/container/restart_count
Resource type: k8s_container
您可以在Stackoverflow案例中找到有关监视警报
的一些有用信息,例如:
Some useful information you can find in Stackoverflow cases regarding Monitoring alert
like:
这篇关于在预期的持续时间内未调用“活力"探针和“就绪"探针的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!