发生什么Azure Kubernetes(AKS)“超时"会断开群集中Pod的入/出连接断开? [英] What Azure Kubernetes (AKS) 'Time-out' happens to disconnect connections in/out of a Pod in my Cluster?

查看:130
本文介绍了发生什么Azure Kubernetes(AKS)“超时"会断开群集中Pod的入/出连接断开?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个工作正常的群集,其中的所有服务均在Azure AKS上运行的头盔安装的Ingress nGinx后面响应. 这最终是特定于Azure的.

我的问题是: 为什么我与该群集中的服务/pod的连接会定期断开(显然是由于某种空闲超时),为什么该连接断开似乎也与我的Az AKS Browse UI连接被断开相吻合?

这是对最终触发超时的最终答案,该超时会导致导致本地浏览"代理UI与群集断开连接(有关我要求关注的更多背景信息).

从Az CLI使用Azure AKS时,可以使用以下方法从终端启动本地浏览UI:

az aks browse --resource-group <resource-group> --name <cluster-name>

这很好用,并弹出一个浏览器窗口,看起来像这样(是):

在您的终端中,您将看到以下内容:

    http://127.0.0.1:8001/上运行的
  1. 代理隧道...
  2. 转发自127.0.0.1:8001-> 9090转发自
  3. [:: 1]:8001-> 9090 8001的处理连接8001的处理连接8001的处理连接

如果您与群集的连接处于空闲状态几分钟(即您未与UI进行交互),则应看到以下打印文字以表明连接已超时:

E0605 13:39:51.940659 5704 portforward.go:178]失去了与吊舱的连接

我仍然不了解的一件事是,集群内部的其他活动是否可以延长此超时时间,但是无论您何时看到以上内容,您基本上都和我在同一地方...这意味着我们可以谈谈事实看来我的所有其他从该服务器Pod中的其他连接OUT也已被关闭,无论超时过程如何导致与AKS浏览UI的联系中断.

那是什么问题?

这对我来说是一个问题,原因是我有一个运行Ghost Blog pod的服务,该服务使用名为"Knex"的npm软件包连接到远程MySQL数据库.碰巧,较新版本的Knex有一个bug(有待解决),如果Knex客户端与远程数据库服务器之间的连接被切断并需要恢复,它不会重新连接,而是无限地连接.加载.

nGinx错误503网关超时

在我的情况下,导致nGinx Ingress出现了错误503网关超时.这是因为空闲超时中断了Knex连接后,Ghost没有响应-因为Knex无法正常工作,并且无法正确恢复断开的服务器连接.

好. 我回滚了Knex,一切正常.

但是为什么从我的数据库开始就断开我的pod连接呢?

因此,希望在以后解决服务/pod闲置一段时间后与Kubernetes(也许是Azure特定,也许不是)相关的幻象问题时节省一些人日后尝试解决问题的麻烦.

解决方案

简短答案:

Azure AKS在添加新的入口(nGinx/Traefik ... ANY入口)时自动部署Azure负载平衡器(具有公共IP地址)-该负载平衡器将其设置配置为基本" Azure LB, 4分钟的空闲连接超时.

该空闲超时既是标准也是必需的(尽管您可以修改它,请参见此处: https://github.com/pgbouncer/pgbouncer )作为我们集群上的一个额外容器,其方法如下:http://127.0.0.1:8001/ Press CTRL+C to close the tunnel...

  • Forwarding from 127.0.0.1:8001 -> 9090 Forwarding from
  • [::1]:8001 -> 9090 Handling connection for 8001 Handling connection for 8001 Handling connection for 8001
  • If you leave the connection to your Cluster idle for a few minutes (ie. you don't interact with the UI) you should see the following print to indicate that the connection has timed out:

    E0605 13:39:51.940659 5704 portforward.go:178] lost connection to pod

    One thing I still don't understand is whether OTHER activity inside of the Cluster can prolong this timeout but regardless once you see the above you are essentially at the same place I am... which means we can talk about the fact that it looks like all of my other connections OUT from pods in that server have also been closed by whatever timeout process is responsible for cutting ties with the AKS browse UI.

    So what's the issue?

    The reason this is a problem for me is that I have a Service running a Ghost Blog pod which connects to a remote MySQL database using an npm package called 'Knex'. As it happens the newer versions of Knex have a bug (which has yet to be addressed) whereby if a connection between the Knex client and a remote db server is cut and needs to be restored — it doesn't re-connect and just infinitely loads.

    nGinx Error 503 Gateway Time-out

    In my situation that resulted in nGinx Ingress giving me an Error 503 Gateway time-out. This was because Ghost wasn't responding after the Idle timeout cut the Knex connection — since Knex wasn't working properly and doesn't restore the broken connection to the server properly.

    Fine. I rolled back Knex and everything works great.

    But why the heck are my pod connections being severed from my Database to begin with?

    Hence this question to hopefully save some future person days of attempting to troubleshoot phantom issues that relate back to Kubernetes (maybe Azure specific, maybe not) cutting connections after a service / pod has been idle for some time.

    解决方案

    Short Answer:

    Azure AKS automatically deploys an Azure Load Balancer (with public IP address) when you add a new ingress (nGinx / Traefik... ANY Ingress) — that Load Balancer has its settings configured as a 'Basic' Azure LB which has a 4 minute idle connection timeout.

    That idle timeout is both standard AND required (although you MAY be able to modify it, see here: https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout). That being said there is no way to ELIMINATE it entirely for any traffic that is heading externally OUT from the Load Balancer IP — the longest duration currently supported is 30 minutes.

    There is no native Azure way to get around an idle connection being cut.

    So as per the original question, the best way (I feel) to handle this is to leave the timeout at 4 minutes (since it has to exist anyway) and then setup your infrastructure to disconnect your connections in a graceful way (when idle) prior to hitting the Load Balancer timeout.

    Our Solutions

    For our Ghost Blog (which hit a MySQL database) I was able to roll back as mentioned above which made the Ghost process able to handle a DB disconnect / reconnect scenario.

    What about Rails?

    Yep. Same problem.

    For a separate Rails based app we also run on AKS which is connecting to a remote Postgres DB (not on Azure) we ended up implementing PGbouncer (https://github.com/pgbouncer/pgbouncer) as an additional container on our Cluster via the awesome directions found here: https://github.com/edoburu/docker-pgbouncer/tree/master/examples/kubernetes/singleuser

    Generally anyone attempting to access a remote database FROM AKS is probably going to need to implement an intermediary connection pooling solution. The pooling service sits in the middle (PGbouncer for us) and keeps track of how long a connection has been idle so that your worker processes don't need to care about that.

    If you start to approach the Load Balancer timeout the connection pooling service will throw out the old connection and make a new fresh one (resetting the timer). That way when your client sends data down the pipe it lands on your Database server as anticipated.

    In closing

    This was an INSANELY frustrating bug / case to track down. We burned at least 2 dev-ops days figuring the first solution out but even KNOWING that it was probably the same issue we burned another 2 days this time around.

    Even elongating the timer beyond the 4 minute default wouldn't really help since that would just make the problem more ephemeral to troubleshoot. I guess I just hope that anyone who has trouble connecting from Azure AKS / Kubernetes to a remote db is better at googling than I am and can save themselves some pain.

    Thanks to MSFT Support (Kris you are the best) for the hint on the LB timer and to the dude who put together PGbouncer in a container so I didn't have to reinvent the wheel.

    这篇关于发生什么Azure Kubernetes(AKS)“超时"会断开群集中Pod的入/出连接断开?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆