GKE上的入口保持状态“后端不健康",即状态为“后端不健康". [英] Ingress on GKE remains in status "Backend unhealthy"

查看:93
本文介绍了GKE上的入口保持状态“后端不健康",即状态为“后端不健康".的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出:

  • 运行nginx的简单吊舱
  • 节点端口服务
  • 入口

从集群中调用 pod 时,我们会收到200响应代码

When calling the pod from within the cluster we get a 200 response code

从集群中调用服务时,我们将获得200响应代码

When calling the service from within the cluster we get a 200 response code

入口显示为注释:

ingress.kubernetes.io/backends: '{"k8s-be-30606--559b9972f521fd4f":"UNHEALTHY"}'

最重要的是,我们有一个完全相同的配置(除了命名空间dev vs qa& timestamps和分配的ips& ports)完全不同的kubernetes集群.

To top things of, we have a different kubernetes cluster with the exact same configuration (apart from the namespace dev vs qa & timestamps & assigned ips & ports) where everything is working properly.

我们已经尝试删除入口,删除Pod,升级Pod,显式定义就绪探针,所有这些都不会改变结果.

We've already tried removing the ingress, deleting pods, upscaling pods, explicitly defining the readiness probe, all without any change in the result.

从上面的判断,是由于某些原因导致Pod的运行状况检查失败(即使我们手动执行(卷曲到节点内部ip +群集中服务的节点端口),它也会返回200) &在qa中,使用相同的容器图片也可以正常工作.

Judging from the above it's the health check on the pod that's failing for some reason (even though if we do it manually (curl to a node internal ip + the node port from the service from within the cluster), it returns 200 & in qa it's working fine with the same container image).

Stackdriver Logging(或其他地方)中是否有可用的日志,我们可以在其中查看运行状况检查正在执行的确切请求以及确切的响应代码是什么? (或者是否由于某些原因而超时?)

Is there any log available in Stackdriver Logging (or elsewhere) where we can see what exact request is being done by that health check and what the exact response code is? (or if it timed out for some reason?)

有什么方法可以让人们更多地了解Google流程中发生的事情吗?

Is there any way to get more view on what's happening in the google processes?

我们使用默认的gke入口控制器.

We use the default gke ingress controller.

一些其他信息: 与完全不同的应用程序进行比较时,我看到大量的此类请求:

Some additional info: When comparing with an entirely different application, I see tons of requests like these:

10.129.128.10 - - [31/May/2018:11:06:51 +0000] "GET / HTTP/1.1" 200 1049 "-" "GoogleHC/1.0"
10.129.128.8 - - [31/May/2018:11:06:51 +0000] "GET / HTTP/1.1" 200 1049 "-" "GoogleHC/1.0"
10.129.128.12 - - [31/May/2018:11:06:51 +0000] "GET / HTTP/1.1" 200 1049 "-" "GoogleHC/1.0"
10.129.128.10 - - [31/May/2018:11:06:51 +0000] "GET / HTTP/1.1" 200 1049 "-" "GoogleHC/1.0"

我认为这是健康检查.对于失败的应用程序或qa中的可用版本,我没有看到任何类似的日志.因此,我认为健康检查最终将在完全不同的地方完成.偶然在qa中返回了200.所以问题仍然存在:我在哪里可以看到运行状况检查执行的实际请求?

Which I assume are the health checks. I don't see any similar logs for the failing application nor for the working version in qa. So I imagine the health checks are ending up somewhere entirely different & by chance in qa it's something that also returns 200. So question remains: where can I see the actual requests performed by a health check?

对于这个特定的应用程序,我还对单个Pod每秒进行大约8次运行状况检查 ,这对我来说似乎有点麻烦(配置的间隔为60秒).是否有可能对其他应用程序进行运行状况检查?

Also for this particular application I see about 8 health checks per second for that single pod which seems to be a bit much to me (the configured interval is 60 seconds). Is it possible health checks for other applications are ending up in this one?

推荐答案

GKE正在管理防火墙规则.由于某种原因,入口使用的新(节点)端口不再自动添加到该规则中.在控制台中将此规则手动添加新端口后,后端服务就可以正常运行了.

GKE is managing a firewall rule. For some reason new (node) ports used by ingresses aren't added automatically anymore to this rule. After adding the new ports manually to this rule in the console, the backend service became healthy.

仍然需要找出答案:

  • 为什么端口不再自动添加?
  • 为什么我在访问日志中看不到健康检查?

无论如何,我希望这可以对其他人有所帮助,因为我们浪费了大量时间来找出答案.

In any case I hope this can help someone else since we wasted a huge amount of time finding this out.

修改:

该错误被证明是tls终止使用的无效证书,该证书由不相关的入口(但由同一控制器管理)除外.解决该问题后,该规则会再次自动更新.

The error turned out to be an invalid certificate used by tls termination by an unrelated (except that it's managed by the same controller) ingress. Once that was fixed, the rule was updated automatically again.

这篇关于GKE上的入口保持状态“后端不健康",即状态为“后端不健康".的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆