在预期的持续时间内未调用“活力"探针和“就绪"探针 [英] Liveness Probe, Readiness Probe not called in expected duration

查看:110
本文介绍了在预期的持续时间内未调用“活力"探针和“就绪"探针的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在GKE上,我尝试使用就绪探针/活动探针,并使用监视

On GKE, I tried to use readiness probe/ liveness probe , and post alert using monitoring https://cloud.google.com/monitoring/alerts/using-alerting-ui

作为测试,我创建了一个具有就绪探针/活跃度探针的Pod.如我所料,每次探针检查都会失败.

as a test, I create a pod which has readiness probe/ liveness probe. Probe check failed everytime, as I expected.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/liveness
    args:
    - /server
    readinessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: X-Custom-Header
          value: Awesome
      initialDelaySeconds: 0
      periodSeconds: 10      
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 3
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: X-Custom-Header
          value: Awesome
      initialDelaySeconds: 20
      periodSeconds: 60
      timeoutSeconds: 30      
      successThreshold: 1
      failureThreshold: 3 

然后检查GCP日志,首先根据periodSeconds出现两个错误日志.

And checking GCP log, both error logs showed up based on periodSeconds first.

准备情况调查:每10秒

Readiness probe: every 10 seconds

2021-02-21 13:26:30.000 JST就绪探针失败:HTTP探针失败,状态码:500

2021-02-21 13:26:30.000 JST Readiness probe failed: HTTP probe failed with statuscode: 500

2021-02-21 13:26:40.000 JST就绪探针失败:HTTP探针失败,状态码:500

2021-02-21 13:26:40.000 JST Readiness probe failed: HTTP probe failed with statuscode: 500

活力探针:每1分钟

2021-02-21 13:25:40.000 JST活动性探查失败:HTTP探查失败,状态码:500

2021-02-21 13:25:40.000 JST Liveness probe failed: HTTP probe failed with statuscode: 500

2021-02-21 13:26:40.000 JST活动性探查失败:HTTP探查失败,状态码:500

2021-02-21 13:26:40.000 JST Liveness probe failed: HTTP probe failed with statuscode: 500

但是,在运行了该Pod几分钟后

But, after running this pod several minutes

  • 没有再进行活动探针检查
  • 调用了就绪探针检查,但是间隔变长了(最大间隔看起来为10分钟)
$ kubectl get event
LAST SEEN   TYPE      REASON      OBJECT              MESSAGE
30m         Normal    Pulling     pod/liveness-http   Pulling image "k8s.gcr.io/liveness"
25m         Warning   Unhealthy   pod/liveness-http   Readiness probe failed: HTTP probe failed with statuscode: 500
20m         Warning   BackOff     pod/liveness-http   Back-off restarting failed container
20m         Normal    Scheduled   pod/liveness-http   Successfully assigned default/liveness-http to gke-cluster-default-pool-8bc9c75c-rfgc
17m         Normal    Pulling     pod/liveness-http   Pulling image "k8s.gcr.io/liveness"
17m         Normal    Pulled      pod/liveness-http   Successfully pulled image "k8s.gcr.io/liveness"
17m         Normal    Created     pod/liveness-http   Created container liveness
20m         Normal    Started     pod/liveness-http   Started container liveness
4m59s       Warning   Unhealthy   pod/liveness-http   Readiness probe failed: HTTP probe failed with statuscode: 500
17m         Warning   Unhealthy   pod/liveness-http   Liveness probe failed: HTTP probe failed with statuscode: 500
17m         Normal    Killing     pod/liveness-http   Container liveness failed liveness probe, will be restarted


在我的计划中,我将创建警报条件,其条件是


In my plan, I would create alert policy, whose condition is like

  • 如果活动探测错误在3分钟内发生3次

但是,如果没有按我期望的那样进行探测检查,则这些策略将无法正常工作;即使pod没有运行,警报也已修复

but if probe check didn't called as I expect, these policy didn't work; even if pod is not running, alert became fixed

为什么活力"探针未运行,并且就绪"探针的间隔已更改?

Why Liveness probe didn't run, and interval of Readiness probe changed ?

注意:如果还有其他好的警报策略可以检查pod的活动性,则我不会在意这种行为.如果有人会建议我哪种类型的警报策略最适合检查吊舱,我将不胜感激.

Note: if there are other good alert policy to check liveness of pod, I would not care that behavior. I appreciate it if someone would advice me what kind of alert policy is ideal to check pod.

推荐答案

背景

配置活动性,就绪性和启动性探针文档可找到信息:

kubelet使用 liveness探针来了解何时重新启动容器.例如,活动性探针可能会遇到死锁,而死锁是应用程序正在运行的地方,但无法取得进展.在这种状态下重新启动容器可以帮助使该应用程序在存在错误的情况下更加可用.

The kubelet uses liveness probes to know when to restart a container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs.

kubelet使用 readiness probes 来了解容器何时准备开始接受流量.当Pod的所有容器都准备就绪时,便视为准备就绪.此信号的一种用法是控制将哪些Pod用作服务的后端.当Pod尚未就绪时,会将其从服务负载平衡器中删除.

The kubelet uses readiness probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

由于 GKE 主目录由google管理,因此您不会使用 CLI 找到 kubelet 日志(您可能会尝试使用Stackdriver ).我已经在 Kubeadm 集群上对其进行了测试,并将 verbosity 级别设置为 8 .

As GKE master is managed by google, you won't find kubelet logs using CLI (you might try to use Stackdriver). I've tested it on the Kubeadm cluster and set the verbosity level to 8.

当您使用 $ kubectl获取事件时,您仅从最近一小时获取事件(可以在Kubernetes设置中更改- Kubeadm ,但我不认为可以在 GKE 中更改它,因为master是由google管理的.)

When you are using $ kubectl get events you are getting events only from the last hour (it can be changed in Kubernetes settings - Kubeadm but I don't think it can be changed in GKE as master is managed by google.)

$ kubectl get events
LAST SEEN   TYPE      REASON                    OBJECT              MESSAGE
37m         Normal    Starting                  node/kubeadm        Starting kubelet.
...
33m         Normal    Scheduled                 pod/liveness-http   Successfully assigned default/liveness-http to kubeadm
33m         Normal    Pulling                   pod/liveness-http   Pulling image "k8s.gcr.io/liveness"
33m         Normal    Pulled                    pod/liveness-http   Successfully pulled image "k8s.gcr.io/liveness" in 893.953679ms
33m         Normal    Created                   pod/liveness-http   Created container liveness
33m         Normal    Started                   pod/liveness-http   Started container liveness
3m12s       Warning   Unhealthy                 pod/liveness-http   Readiness probe failed: HTTP probe failed with statuscode: 500
30m         Warning   Unhealthy                 pod/liveness-http   Liveness probe failed: HTTP probe failed with statuscode: 500
8m17s       Warning   BackOff                   pod/liveness-http   Back-off restarting failed container

〜1小时之后再次执行相同的命令.

Again the same command after ~1 hour.

$ kubectl get events
LAST SEEN   TYPE      REASON      OBJECT              MESSAGE
33s         Normal    Pulling     pod/liveness-http   Pulling image "k8s.gcr.io/liveness"
5m40s       Warning   Unhealthy   pod/liveness-http   Readiness probe failed: HTTP probe failed with statuscode: 500
15m         Warning   BackOff     pod/liveness-http   Back-off restarting failed container

测试

每10秒执行一次 Readiness Probe 检查,持续一个多小时.

Tests

The Readiness Probe check is executed each 10 seconds for more than one hour.

Mar 09 14:48:34 kubeadm kubelet[3855]: I0309 14:48:34.222085    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 14:48:44 kubeadm kubelet[3855]: I0309 14:48:44.221782    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 14:48:54 kubeadm kubelet[3855]: I0309 14:48:54.221828    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 15:01:34 kubeadm kubelet[3855]: I0309 15:01:34.222491    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4
562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:01:44 kubeadm kubelet[3855]: I0309 15:01:44.221877    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:01:54 kubeadm kubelet[3855]: I0309 15:01:54.221976    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 15:10:14 kubeadm kubelet[3855]: I0309 15:10:14.222163    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:10:24 kubeadm kubelet[3855]: I0309 15:10:24.221744    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:10:34 kubeadm kubelet[3855]: I0309 15:10:34.223877    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 16:04:14 kubeadm kubelet[3855]: I0309 16:04:14.222853    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:04:24 kubeadm kubelet[3855]: I0309 16:04:24.222531    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500

此外,还有 Liveness探针条目.

Mar 09 16:12:58 kubeadm kubelet[3855]: I0309 16:12:58.462878    3855 prober.go:117] Liveness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:13:58 kubeadm kubelet[3855]: I0309 16:13:58.462906    3855 prober.go:117] Liveness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465470    3855 kuberuntime_manager.go:656] Container "liveness" ({"docker" "95567f85708ffac8b34b6c6f2bdb4
9d8eb57e7704b7b416083c7f296dd40cd0b"}) of pod liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a): Container liveness failed liveness probe, will be restarted
Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465587    3855 kuberuntime_manager.go:712] Killing unwanted container "liveness"(id={"docker" "95567f85708ffac8b34b6c6f2bdb49d8eb57e7704b7b416083c7f296dd40cd0b"}) for pod "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a)"

总测试时间:

$ kubectl get po -w
NAME            READY   STATUS    RESTARTS   AGE
liveness-http   0/1     Running   21         99m
liveness-http   0/1     CrashLoopBackOff   21         101m
liveness-http   0/1     Running            22         106m
liveness-http   1/1     Running            22         106m
liveness-http   0/1     Running            22         106m
liveness-http   0/1     Running            23         109m
liveness-http   1/1     Running            23         109m
liveness-http   0/1     Running            23         109m
liveness-http   0/1     CrashLoopBackOff   23         112m
liveness-http   0/1     Running            24         117m
liveness-http   1/1     Running            24         117m
liveness-http   0/1     Running            24         117m

结论

没有再进行活动探针检查

Liveness probe check didn't not called anymore

活动检查是在Kubernetes创建pod时创建的,并且每次重新启动 Pod 时都会重新创建.在您的配置中,您已设置 initialDelaySeconds:20 ,因此在创建 pod 之后,Kubernetes将等待20秒,然后将调用 liveness 探针3次(因为您已设置 failureThreshold:3 ).3失败后,Kubernetes将根据 RestartPolicy 重新启动此Pod.此外,您还可以在日志中找到日志:

Liveness check is created when Kubernetes create pod and is recreated each time that Pod is restarted. In your configuration you have set initialDelaySeconds: 20 so after creating a pod, Kubernetes will wait 20 seconds, then it will call liveness probe 3 times (as you have set failureThreshold: 3). After 3 fails, Kubernetes will restart this pod according to RestartPolicy. Also in logs you will be able to find in logs:

Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465470    3855 kuberuntime_manager.go:656] Container "liveness" ({"docker" "95567f85708ffac8b34b6c6f2bdb4
9d8eb57e7704b7b416083c7f296dd40cd0b"}) of pod liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a): Container liveness failed liveness probe, will be restarted

为什么要重新启动?可以在容器探针中找到答案.

Why will it be restarted? Answer can be found in Container probes.

livenessProbe:指示容器是否正在运行.如果活动性探针失败,则kubelet将杀死该容器,并对该容器执行重新启动策略.

livenessProbe: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy.

中的默认重启策略> GKE 始终为 .因此,您的pod将会一遍又一遍地重新启动.

Default Restart Policy in GKE is Always. So your pod will be restarted over and over again.

调用了就绪探针检查,但是间隔变长了(最大间隔看起来为10分钟)

Readiness probe check called but interval became long ( maximum interval looks about 10 minutes)

我认为您已经得出了这个结论,因为您基于 $ kubectl get事件 $ kubectl描述po .在这两种情况下,默认事件都会在1小时后删除.在我的 Tests 部分中,您可以看到 Readiness probe 条目从 14:48:34 16:04:24 ,因此Kubernetes每10秒就会调用 Readiness Probe .

I think you've come to that conclusion as you have based on $ kubectl get events and $ kubectl describe po. In both cases, events as default are removed after 1 hour. In my Tests part you can see that Readiness probe entries are from 14:48:34 till 16:04:24, so each 10 seconds Kubernetes calls Readiness Probe.

为什么活力"探针未运行,并且就绪"探针的间隔已更改?

Why Liveness probe didn't run, and interval of Readiness probe changed?

正如我在 Tests 部分中向您展示的那样, Readiness探针并没有改变.在这种情况下,误导是使用 $ kubectl事件.关于 Liveiness Probe ,它仍在调用,但在pod创建后仅3次/重新启动.另外,我还包括了 $ kubectl get po -w 的输出.重新创建 pod 后,您可能会在kubelet日志中找到这些 liveness探针.

As I show you in the Tests part, the Readiness probe didn't change. Misleading in this case was using $ kubectl events. Regarding Liveiness Probe it's still calling but only 3 times after pod will be created/restarted. Also I've included output of $ kubectl get po -w. When pod is recreated, you might find in kubelet logs those liveness probes.

在我的计划中,我将创建警报策略,其条件如下:

In my plan, I would create alert policy, whose condition is like:

  • 如果活动探测错误在3分钟内发生3次

如果 liveness probe 失败3次,则使用您当前的设置,它将重新启动此Pod.在这种情况下,您可以使用每个 restart 来创建一个 alert .

If liveness probe fails 3 times, with your current setup it will restart this pod. In that situation you could use each restart to create an alert.

Metric: kubernetes.io/container/restart_count
Resource type: k8s_container

您可以在Stackoverflow案例中找到有关监视警报的一些有用信息,例如:

Some useful information you can find in Stackoverflow cases regarding Monitoring alert like:

这篇关于在预期的持续时间内未调用“活力"探针和“就绪"探针的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆