“无法连接 Net/http:TLS 握手超时"——为什么 Kubectl 无法连接到 Azure Kubernetes 服务器?(AKS) [英] 'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)

查看:46
本文介绍了“无法连接 Net/http:TLS 握手超时"——为什么 Kubectl 无法连接到 Azure Kubernetes 服务器?(AKS)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<块引用>

我的问题(对 MS 和其他任何人)是:为什么会出现此问题,以及用户/客户自己而不是 Microsoft 支持可以实施哪些解决方法?

显然还有一些"关于这个问题的其他问题:

  1. 利用率和网络 io 的下降与磁盘利用率的增加和我们开始遇到问题的时间段密切相关.

    在此图表之前的过去 30 天中,节点/VM 的整体利用率通常持平,但在生产站点流量/更新推送等方面出现了一些波动.

    问题缓解后的指标(添加事后分析)

    至于上述观点,以下是同一节点在向上和向下扩展后的指标(这恰好缓解了我们的问题,但并不总是有效 - 请参阅底部的答案):

    注意到 CPU 和网络中的下降"了吗? 这就是 Net/http: TLS 问题影响我们的地方——以及当 Kubectl 无法访问 AKS 服务器时.除了不响应我们的请求之外,它似乎没有与 VM/Node 对话.

    我们一回来(将 # 个节点按比例放大一个,然后再缩小 - 请参阅解决方法的答案)指标(CPU 等)恢复正常 - 我们可以从 Kubectl 连接.这意味着我们可能可以针对这种行为创建一个警报(我在 Azure DevOps 方面询问这个问题时遇到了一个问题:

    相同的环境受到不同的影响

    我们的两个集群都运行相同的入口、服务、pod、容器,因此用户所做的任何事情也不太可能导致此问题突然出现.

    重新创建有时会成功

    以上存在多个 AKS 管理服务器分区域职责,与其他用户在 github (

    1. 登录 Azure 控制台 — Kubernetes 服务刀片.
    2. 将您的集群扩展 1 个节点.
    3. 等待缩放完成并尝试连接(您应该可以连接).
    4. 将您的集群缩小到正常规模以避免成本增加.

    或者,您也可以(也许)从命令行执行此操作:

    az aks scale --name --node-count <新节点数>--resource-group

    由于这是一个棘手的问题,而且我使用的是 Web 界面,因此我不确定上述内容是否相同或是否可行.

    我花了大约 2 分钟的总时间 - 对于我的情况,这比重新创建/配置集群(可能多次...)要好得多

    话虽如此....

    Zimmergren 提出了一些好处,即缩放不是真正的解决方案:

    有时它会奏效,在扩展后集群自我修复一段时间.它有时会失败并出现相同的错误.我不考虑扩展此问题的解决方案,因为这会导致其他挑战,具体取决于事物的设置方式了.我不会相信 GA 工作负载的例程,这是肯定的.在当前的预览中,它有点狂野西部(和预期),我很高兴炸毁集群并创建一个新的集群连续失败."(https://github.com/Azure/AKS/issues/268#issuecomment-395299308)

    Azure 支持反馈

    由于我在遇到上述扩展解决方案时打开了支持工单,因此我能够获得有关上述可能有效的反馈(或更确切地说是猜测),以下是转述回复:

    <块引用>

    我知道,如果您进入az aks show"和kubectl get nodes"之间的节点数量不匹配的状态,扩展集群有时会有所帮助.这可能是相似的."

    解决方法参考:

    1. GitHub 用户从控制台缩放节点并修复了问题:https://github.com/Azure/AKS/issues/268#issuecomment-375722317

    解决方法不起作用?

    如果这对您不起作用,请在下面发表评论,因为我将尝试更新问题出现的频率、是否自行解决以及此解决方案是否适用于 Azure AKS用户(看起来并不适合所有人).

    用户向上/向下扩展不适用于:

    1. omgsarge (https://github.com/Azure/AKS/issues/112#issuecomment-395231681)
    2. Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
    3. sercand — 缩放操作本身失败 — 不确定它是否会影响可连接性 (https://github.com/Azure/AKS/issues/268#issuecomment-395301296)

    按比例放大/缩小 DID 适用于:

    1. LohithChanda (https://github.com/Azure/AKS/issues/268#issuecomment-395207716)
    2. Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)

    通过电子邮件发送 Azure AKS 特定支持

    如果在所有诊断后您仍然遇到此问题,请不要犹豫,发送电子邮件至 aks-help@service.microsoft.com

    My question (to MS and anyone else) is: Why is this issue occurring and what work around can be implemented by the users / customers themselves as opposed to by Microsoft Support?

    There have obviously been 'a few' other question about this issue:

    1. Managed Azure Kubernetes connection error
    2. Can't contact our Azure-AKS kube - TLS handshake timeout
    3. Azure Kubernetes: TLS handshake timeout (this one has some Microsoft feedback)

    And multiple GitHub issues posted to the AKS repo:

    1. https://github.com/Azure/AKS/issues/112
    2. https://github.com/Azure/AKS/issues/124
    3. https://github.com/Azure/AKS/issues/164
    4. https://github.com/Azure/AKS/issues/177
    5. https://github.com/Azure/AKS/issues/324

    Plus a few twitter threads:

    1. https://twitter.com/ternel/status/955871839305261057

    TL;DR

    Skip to workarounds in Answers below.

    Current best solution is post a help ticket — and wait — or re-create your AKS cluster (maybe more than once, cross your fingers, see below...) but there should be something better. At least please grant the ability to let AKS preview customers, regardless of support tier, upgrade their support request severity for THIS specific issue.

    You can also try scaling your Cluster (assuming that doesn't break your app).

    What about GitHub?

    Many of the above GitHub issues have been closed as resolved but the issue persists. Previously there was an announcements document regarding the problem but no such status updates are currently available even though the problem continues to present itself:

    1. https://github.com/Azure/AKS/tree/master/annoucements

    I am posting this as I have a few new tidbits that I haven't seen elsewhere and I am wondering if anyone has ideas as far as other potential options for working around the issue.

    Affected VM / Node Resource Usage

    The first piece I haven't seen mentioned elsewhere is Resource usage on the nodes / vms / instances that are being impacted by the above Kubectl 'Unable to connect to the server: net/http: TLS handshake timeout' issue.

    Production Node Utilization

    The node(s) on my impacted cluster look like this:

    The drop in utilization and network io correlates strongly with both the increase in disk utilization AND the time period we began experiencing the issue.

    The overall Node / VM utilization is generally flat prior to this chart for the previous 30 days with a few bumps relating to production site traffic / update pushes etc.

    Metrics After Issue Mitigation (Added Postmortem)

    To the above point, here are the metrics the same Node after Scaling up and then back down (which happened to alleviate our issue, but does not always work — see answers at bottom):

    Notice the 'Dip' in CPU and Network? That's where the Net/http: TLS issue impacted us — and when the AKS Server was un-reachable from Kubectl. Seems like it wasn't talking to the VM / Node in addition to not responding to our requests.

    As soon as we were back (scaled the # nodes up by one, and back down — see answers for workaround) the Metrics (CPU etc) went back to normal — and we could connect from Kubectl. This means we can probably create an Alarm off of this behavior (and I have a issue in asking about this on Azure DevOps side: https://github.com/Azure/AKS/issues/416)

    Node Size Potentially Impacts Issue Frequency

    Zimmergren over on GitHub indicates that he has less issues with larger instances than he did running bare bones smaller nodes. This makes sense to me and could indicate that the way the AKS servers divy up the workload (see next section) could be based on the size of the instances.

    "The size of the nodes (e.g. D2, A4, etc) :) I've experienced that when running A4 and up, my cluster is healther than if running A2, for example. (And I've got more than a dozen similar experiences with size combinations and cluster failures, unfortunately)." (https://github.com/Azure/AKS/issues/268#issuecomment-375715435)

    Other Cluster size impact references:

    1. giorgited (https://github.com/Azure/AKS/issues/268#issuecomment-376390692)

    An AKS server responsible for more smaller Clusters may possibly get hit more often?

    Existence of Multiple AKS Management 'Servers' in one Az Region

    The next thing I haven't seen mentioned elsewhere is the fact that you can have multiple Clusters running side by side in the same Region where one Cluster (production for us in this case) gets hit with 'net/http: TLS handshake timeout' and the other is working fine and can be connected to normally via Kubectl (for us this is our identical staging environment).

    The fact that users (Zimmergren etc above) seem to feel that the Node size impacts the likelihood that this issue will impact you also seems to indicate that node size may relate to the way the sub-region responsibilities are assigned to the sub-regional AKS management servers.

    That could mean that re-creating your cluster with a different Cluster size would be more likely to place you on a different management server — alleviating the issue and reducing the likelihood that multiple re-creations would be necessary.

    Staging Cluster Utilization

    Both of our AKS Clusters are in U.S. East. As a reference to the above 'Production' Cluster metrics our 'Staging' Cluster (also U.S. East) resource utilization does not have the massive drop in CPU / Network IO — AND does not have the increase in disk etc. over the same period:

    Identical Environments are Impacted Differently

    Both of our Clusters are running identical ingresses, services, pods, containers so it is also unlikely that anything a user is doing causes this problem to crop up.

    Re-creation is only SOMETIMES successful

    The above existence of multiple AKS management server sub-regional responsibilities makes sense with the behavior described by other users on github (https://github.com/Azure/AKS/issues/112) where some users are able to re-create a cluster (which can then be contacted) while others re-create and still have issues.

    Emergency could = Multiple Re-Creations

    In an emergency (ie your production site... like ours... needs to be managed) you can PROBABLY just re-create until you get a working cluster that happens to land on a different AKS management server instance (one that is not impacted) but be aware that this may not happen on your first attempt — AKS cluster re-creation is not exactly instant.

    That said...

    Resources on the Impacted Nodes Continue to Function

    All of the containers / ingresses / resources on our impacted VM appear to be working well and I don't have any alarms going off for up-time / resource monitoring (other than the utilization weirdness listed above in the graphs)

    I want to know why this issue is occurring and what work around can be implemented by the users themselves as opposed to by Microsoft Support (currently have a ticket in). If you have an idea let me know.

    Potential Hints at the Cause

    1. https://github.com/Azure/AKS/issues/164#issuecomment-363613110
    2. https://github.com/Azure/AKS/issues/164#issuecomment-365389154

    Why no GKE?

    I understand that Azure AKS is in preview and that a lot of people have moved to GKE because of this problem (). That said my Azure experience has been nothing but positive thus far and I would prefer to contribute a solution if at all possible.

    And also... GKE occasionally faces something similar:

    1. TLS handshake timeout with kubernetes in GKE

    I would be interested to see if scaling the nodes on GKE also solved the problem over there.

    解决方案

    Workaround 1 (May Not Work for Everyone)

    An interesting solution (worked for me) to test is scaling the number of nodes in your cluster up, and then back down...

    1. Log into the Azure Console — Kubernetes Service blade.
    2. Scale your cluster up by 1 node.
    3. Wait for scale to complete and attempt to connect (you should be able to).
    4. Scale your cluster back down to the normal size to avoid cost increases.

    Alternately you can (maybe) do this from the command line:

    az aks scale --name <name-of-cluster> --node-count <new-number-of-nodes> --resource-group <name-of-cluster-resource-group>

    Since it is a finicky issue and I used the web interface I am uncertain if the above is identical or would work.

    Total time it took me ~2 minutes — for my situation that is MUCH better than re-creating / configuring a Cluster (potentially multiple times...)

    That being Said....

    Zimmergren brings up some good points that Scaling is not a true Solution:

    "It worked sometimes, where the cluster self-healed a period after scaling. It failed sometimes with the same errors. I don't consider scaling a solution to this problem, as that causes other challenges depending on how things are set up. I wouldn't trust that routine for a GA workload, that's for sure. In the current preview, it's a bit wild west (and expected), and I'm happy to blow up the cluster and create a new one when this fails continuously." (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)

    Azure Support Feedback

    Since I had a support ticket open at the time I ran into the above scaling solution I was able to get feedback (or rather a guess) on what the above might have worked, here's a paraphrased response:

    "I know that scaling the cluster can sometimes help if you get into a state where the number of nodes is mismatched between "az aks show" and "kubectl get nodes". This may be similar."

    Workaround References:

    1. GitHub user Scaled nodes from console and fixed the problem: https://github.com/Azure/AKS/issues/268#issuecomment-375722317

    Workaround Didn't Work?

    If this DOES NOT work for you, please post a comment below as I am going to try to keep an up to date list of how often the issue crops up, whether it resolves itself, and whether this solution works across Azure AKS users (looks like it doesn't work for everyone).

    Users Scaling Up / Down DID NOT work for:

    1. omgsarge (https://github.com/Azure/AKS/issues/112#issuecomment-395231681)
    2. Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
    3. sercand — scale operation itself failed — not sure if it would have impacted connectability (https://github.com/Azure/AKS/issues/268#issuecomment-395301296)

    Scaling Up / Down DID work for:

    1. Me
    2. LohithChanda (https://github.com/Azure/AKS/issues/268#issuecomment-395207716)
    3. Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)

    Email Azure AKS Specific Support

    If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help@service.microsoft.com

    这篇关于“无法连接 Net/http:TLS 握手超时"——为什么 Kubectl 无法连接到 Azure Kubernetes 服务器?(AKS)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆