“无法连接Net/http:TLS握手超时" — Kubectl为什么不能连接到Azure Kubernetes服务器? (AKS) [英] 'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)

查看:1377
本文介绍了“无法连接Net/http:TLS握手超时" — Kubectl为什么不能连接到Azure Kubernetes服务器? (AKS)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(对MS和其他任何人)我的问题是:为什么会发生此问题,并且用户/客户自己可以实现哪些解决方案,而不是与Microsoft支持相对?

My question (to MS and anyone else) is: Why is this issue occurring and what work around can be implemented by the users / customers themselves as opposed to by Microsoft Support?

关于此问题,显然还有其他一些问题:

There have obviously been 'a few' other question about this issue:

  1. 托管的Azure Kubernetes连接错误
  2. 无法联系我们的Azure-AKS kube-TLS握手超时
  3. Azure Kubernetes:TLS握手超时(这有一些Microsoft反馈)
  1. Managed Azure Kubernetes connection error
  2. Can't contact our Azure-AKS kube - TLS handshake timeout
  3. Azure Kubernetes: TLS handshake timeout (this one has some Microsoft feedback)

以及发布到AKS存储库中的多个GitHub问题:

And multiple GitHub issues posted to the AKS repo:

  1. https://github.com/Azure/AKS/issues/112
  2. https://github.com/Azure/AKS/issues/124
  3. https://github.com/Azure/AKS/issues/164
  4. https://github.com/Azure/AKS/issues/177
  5. https://github.com/Azure/AKS/issues/324
  1. https://github.com/Azure/AKS/issues/112
  2. https://github.com/Azure/AKS/issues/124
  3. https://github.com/Azure/AKS/issues/164
  4. https://github.com/Azure/AKS/issues/177
  5. https://github.com/Azure/AKS/issues/324

加上一些Twitter主题:

Plus a few twitter threads:

  1. https://twitter.com/ternel/status/955871839305261057

TL; DR

当前最好的解决方案是发布帮助票-等待-或重新创建您的AKS集群(可能不止一次,用手指交叉,请参阅下文...),但是应该有更好的选择. 至少请授予允许AKS预览客户(无论支持级别如何)针对此特定问题升级其支持请求严重性的功能.

Current best solution is post a help ticket — and wait — or re-create your AKS cluster (maybe more than once, cross your fingers, see below...) but there should be something better. At least please grant the ability to let AKS preview customers, regardless of support tier, upgrade their support request severity for THIS specific issue.

您还可以尝试扩展集群(假设这不会破坏您的应用程序).

You can also try scaling your Cluster (assuming that doesn't break your app).

上述GitHub问题中的许多问题已解决,但问题仍然存在.以前有关于此问题的公告文档,但是即使问题继续存在,目前也没有此类状态更新:

Many of the above GitHub issues have been closed as resolved but the issue persists. Previously there was an announcements document regarding the problem but no such status updates are currently available even though the problem continues to present itself:

  1. https://github.com/Azure/AKS/tree/master/annoucements

我发布此消息是因为我有一些其他地方没有看到的新消息,我想知道是否有人对解决此问题有其他可能的选择.

I am posting this as I have a few new tidbits that I haven't seen elsewhere and I am wondering if anyone has ideas as far as other potential options for working around the issue.

我在其他地方未提到的第一部分是受上述Kubectl无法连接到服务器:net/http:TLS握手超时"问题影响的节点/vms/实例上的资源使用情况.

The first piece I haven't seen mentioned elsewhere is Resource usage on the nodes / vms / instances that are being impacted by the above Kubectl 'Unable to connect to the server: net/http: TLS handshake timeout' issue.

我受影响的群集上的节点看起来像这样:

The node(s) on my impacted cluster look like this:

利用率和网络io的下降与磁盘利用率的增加以及我们开始经历此问题的时间段密切相关.

The drop in utilization and network io correlates strongly with both the increase in disk utilization AND the time period we began experiencing the issue.

在此图表之前的30天内,节点/虚拟机的总体利用率通常与该图表持平,与生产站点流量/更新推送等有关的情况有所改善.

The overall Node / VM utilization is generally flat prior to this chart for the previous 30 days with a few bumps relating to production site traffic / update pushes etc.

至此,以下是按比例放大和缩小后的同一节点的指标(碰巧可以缓解我们的问题,但并不总能奏效-请参阅底部的答案):

To the above point, here are the metrics the same Node after Scaling up and then back down (which happened to alleviate our issue, but does not always work — see answers at bottom):

是否注意到CPU和网络中的"Dip"? ?那是Net/http:TLS问题影响到我们的地方-当从Kubectl无法访问AKS服务器时.似乎除了不响应我们的请求之外,没有与VM/Node进行通信.

Notice the 'Dip' in CPU and Network? That's where the Net/http: TLS issue impacted us — and when the AKS Server was un-reachable from Kubectl. Seems like it wasn't talking to the VM / Node in addition to not responding to our requests.

一旦我们返回(将#个节点放大一个,然后向下缩小-参见解决方法的答案),度量标准(CPU等)又恢复了正常-我们就可以从Kubectl连接了.这意味着我们可能可以针对此行为创建警报(并且我在Azure DevOps方面对此有疑问: https://github.com/Azure/AKS/issues/416 )

As soon as we were back (scaled the # nodes up by one, and back down — see answers for workaround) the Metrics (CPU etc) went back to normal — and we could connect from Kubectl. This means we can probably create an Alarm off of this behavior (and I have a issue in asking about this on Azure DevOps side: https://github.com/Azure/AKS/issues/416)

在GitHub上的Zimmergren表示,与运行较小的裸节点相比,他处理大型实例的问题更少.这对我来说很有意义,并且可能表明AKS服务器分担工作负载的方式(请参阅下一部分)可能基于实例的大小.

Zimmergren over on GitHub indicates that he has less issues with larger instances than he did running bare bones smaller nodes. This makes sense to me and could indicate that the way the AKS servers divy up the workload (see next section) could be based on the size of the instances.

节点的大小(例如D2,A4等):) 我已经体验到,例如,在运行A4及更高版本时,我的群集比运行A2更健康. (不幸的是,我在大小组合和群集故障方面有十几个类似的经验.)"( https://github.com/Azure/AKS/issues/268#issuecomment-375715435 )

"The size of the nodes (e.g. D2, A4, etc) :) I've experienced that when running A4 and up, my cluster is healther than if running A2, for example. (And I've got more than a dozen similar experiences with size combinations and cluster failures, unfortunately)." (https://github.com/Azure/AKS/issues/268#issuecomment-375715435)

其他群集大小"影响参考:

Other Cluster size impact references:

  1. giorgited( https://github.com/Azure/AKS/issues/268#issuecomment-376390692 )

负责较小集群的AKS服务器可能会受到更多攻击?

在一个Az区域中存在多个AKS管理服务器"

我在其他地方未曾提及的第二件事是,您可以在同一区域中并排运行多个集群,其中一个集群(在本例中为我们生产)受到"net/http:TLS"的攻击握手超时",另一个可以正常工作,并且可以通过Kubectl正常连接(对我们来说,这是我们相同的暂存环境).

Existence of Multiple AKS Management 'Servers' in one Az Region

The next thing I haven't seen mentioned elsewhere is the fact that you can have multiple Clusters running side by side in the same Region where one Cluster (production for us in this case) gets hit with 'net/http: TLS handshake timeout' and the other is working fine and can be connected to normally via Kubectl (for us this is our identical staging environment).

用户(上面的Zimmergren等)似乎认为节点大小会影响此问题影响您的可能性,这一事实似乎也表明节点大小可能与将子区域职责分配给子区域的方式有关区域AKS管理服务器.

The fact that users (Zimmergren etc above) seem to feel that the Node size impacts the likelihood that this issue will impact you also seems to indicate that node size may relate to the way the sub-region responsibilities are assigned to the sub-regional AKS management servers.

这可能意味着以不同的群集大小重新创建群集将更有可能将您置于其他管理服务器上-减轻了该问题,并降低了需要进行多次重新创建的可能性.

That could mean that re-creating your cluster with a different Cluster size would be more likely to place you on a different management server — alleviating the issue and reducing the likelihood that multiple re-creations would be necessary.

分段群集利用率

我们的两个AKS集群都在美国东部.作为上述生产"群集指标的参考,我们的分段"群集(也称为美国东部)的资源利用率在CPU/网络IO方面没有大量下降,并且在同一时期内磁盘等也没有增加:

Staging Cluster Utilization

Both of our AKS Clusters are in U.S. East. As a reference to the above 'Production' Cluster metrics our 'Staging' Cluster (also U.S. East) resource utilization does not have the massive drop in CPU / Network IO — AND does not have the increase in disk etc. over the same period:

我们的两个集群都在运行相同的入口,服务,吊舱,容器,因此用户所做的任何事情也不太可能导致此问题浮出水面.

Both of our Clusters are running identical ingresses, services, pods, containers so it is also unlikely that anything a user is doing causes this problem to crop up.

以上存在的多个AKS管理服务器子区域责任对于其他用户在github( https://github.com/Azure/AKS/issues/112 ),其中一些用户可以重新创建集群(可以联系该集群),而其他用户可以重新创建并且仍然拥有问题.

The above existence of multiple AKS management server sub-regional responsibilities makes sense with the behavior described by other users on github (https://github.com/Azure/AKS/issues/112) where some users are able to re-create a cluster (which can then be contacted) while others re-create and still have issues.

在紧急情况下(例如,您的生产站点...像我们的生产站点...需要管理),您可以 可能 重新创建,直到工作正常集群恰好落在另一个AKS管理服务器实例上(不受影响的实例),但是请注意,这可能不会在您的第一次尝试中发生-AKS集群重新创建并不是完全即时的.

In an emergency (ie your production site... like ours... needs to be managed) you can PROBABLY just re-create until you get a working cluster that happens to land on a different AKS management server instance (one that is not impacted) but be aware that this may not happen on your first attempt — AKS cluster re-creation is not exactly instant.

那是说...

我们受影响的VM上的所有容器/入口/资源似乎都运行良好,并且我没有任何警报可以正常运行时间/资源监视(除了上面图表中列出的利用率怪异)

All of the containers / ingresses / resources on our impacted VM appear to be working well and I don't have any alarms going off for up-time / resource monitoring (other than the utilization weirdness listed above in the graphs)

我想知道为什么会发生此问题,并且可以由用户自己(而不是通过Microsoft支持)(目前已购买)来实现解决方法.如果您有个主意,请告诉我.

I want to know why this issue is occurring and what work around can be implemented by the users themselves as opposed to by Microsoft Support (currently have a ticket in). If you have an idea let me know.

潜在原因提示

  1. https://github.com/Azure/AKS/issues/164# issuecomment-363613110
  2. https://github.com/Azure/AKS/issues/164# issuecomment-365389154
  1. https://github.com/Azure/AKS/issues/164#issuecomment-363613110
  2. https://github.com/Azure/AKS/issues/164#issuecomment-365389154

为什么没有GKE?

我了解到Azure AKS正在预览中,并且由于这个问题(),很多人已迁移到GKE.就是说,到目前为止,我的Azure经验不过是积极的事情,因此,如果可能的话,我宁愿提供一个解决方案.

Why no GKE?

I understand that Azure AKS is in preview and that a lot of people have moved to GKE because of this problem (). That said my Azure experience has been nothing but positive thus far and I would prefer to contribute a solution if at all possible.

而且... GKE有时也会遇到类似的情况:

And also... GKE occasionally faces something similar:

  1. 在GKE中使用kubernetes进行TLS握手超时

我很想知道在GKE上扩展节点是否也解决了那里的问题.

I would be interested to see if scaling the nodes on GKE also solved the problem over there.

推荐答案

解决方法1(可能不适用于所有人)

一个有趣的解决方案(对我有用)可以测试:将集群中的节点数量放大,然后再缩小...

Workaround 1 (May Not Work for Everyone)

An interesting solution (worked for me) to test is scaling the number of nodes in your cluster up, and then back down...

  1. 登录到Azure控制台-Kubernetes服务刀片.
  2. 按1个节点向上扩展群集.
  3. 等待比例尺完成并尝试连接(您应该可以).
  4. 将群集缩小到正常大小,以避免成本增加.

或者,您可以(也许)从命令行执行此操作:

Alternately you can (maybe) do this from the command line:

az aks scale --name <name-of-cluster> --node-count <new-number-of-nodes> --resource-group <name-of-cluster-resource-group>

由于这是一个棘手的问题,并且我使用了Web界面,因此我不确定上述内容是否相同或行得通.

我总共花了大约2分钟的时间-对于我来说,这种情况比重新创建/配置群集要好得多(可能要多次...)

Total time it took me ~2 minutes — for my situation that is MUCH better than re-creating / configuring a Cluster (potentially multiple times...)

齐默格伦(Zimmergren)提出了一些好点,即缩放不是真正的解决方案:

Zimmergren brings up some good points that Scaling is not a true Solution:

有时它可以工作,集群在扩展后会自我修复一段时间.有时会失败,并出现相同的错误.我不考虑扩展此问题的解决方案,因为这会导致其他挑战,具体取决于事物的设置方式我肯定不会在GA工作负载上使用该例程,这是肯定的.在当前预览中,它有点偏西(并且可以预期),我很乐意炸毁该群集并创建一个新的群集持续失败." ( https://github.com/Azure/AKS/issues/268#issuecomment -395299308 )

"It worked sometimes, where the cluster self-healed a period after scaling. It failed sometimes with the same errors. I don't consider scaling a solution to this problem, as that causes other challenges depending on how things are set up. I wouldn't trust that routine for a GA workload, that's for sure. In the current preview, it's a bit wild west (and expected), and I'm happy to blow up the cluster and create a new one when this fails continuously." (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)

由于在遇到上述扩展解决方案时我已经获得了支持票,因此我能够获得关于上述方法可能起作用的反馈(或更确切地说是猜测),以下是释义的回答:

Since I had a support ticket open at the time I ran into the above scaling solution I was able to get feedback (or rather a guess) on what the above might have worked, here's a paraphrased response:

我知道,如果进入"az aks show"和"kubectl获取节点"之间的节点数量不匹配的状态,扩展群集有时会有所帮助.这可能是相似的."

"I know that scaling the cluster can sometimes help if you get into a state where the number of nodes is mismatched between "az aks show" and "kubectl get nodes". This may be similar."

解决方法参考:

  1. GitHub用户从控制台扩展了节点并解决了问题: https://github.com/Azure/AKS/issues/268#issuecomment-375722317

解决方法不起作用?

如果这不适用于您,请在下面发表评论,因为我将尝试保持最新列表,以了解问题出现的频率,是否可以自行解决以及此解决方案是否在Azure AKS上有效用户(似乎并不适合所有人).

Workaround Didn't Work?

If this DOES NOT work for you, please post a comment below as I am going to try to keep an up to date list of how often the issue crops up, whether it resolves itself, and whether this solution works across Azure AKS users (looks like it doesn't work for everyone).

向上/向下扩展DID的用户不适用于:

Users Scaling Up / Down DID NOT work for:

  1. omgsarge( https://github.com/Azure/AKS/issues/112#issuecomment-395231681 )
  2. Zimmergren( https://github.com/Azure/AKS/issues/268#issuecomment-395299308 )
  3. sercand-缩放操作本身失败-不确定是否会影响可连接性( https://github.com/Azure/AKS/issues/268#issuecomment-395301296 )
  1. omgsarge (https://github.com/Azure/AKS/issues/112#issuecomment-395231681)
  2. Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
  3. sercand — scale operation itself failed — not sure if it would have impacted connectability (https://github.com/Azure/AKS/issues/268#issuecomment-395301296)

按比例放大/缩小DID适用于:

Scaling Up / Down DID work for:

  1. LohithChanda( https://github.com/Azure/AKS/issues/268#issuecomment-395207716 )
  2. Zimmergren( https://github.com/Azure/AKS/issues/268#issuecomment-395299308 )
  1. Me
  2. LohithChanda (https://github.com/Azure/AKS/issues/268#issuecomment-395207716)
  3. Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)

通过电子邮件发送Azure AKS特定支持

如果经过诊断,您仍然遇到此问题,请不要犹豫,将电子邮件发送至aks-help@service.microsoft.com

Email Azure AKS Specific Support

If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help@service.microsoft.com

这篇关于“无法连接Net/http:TLS握手超时" — Kubectl为什么不能连接到Azure Kubernetes服务器? (AKS)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆