连接到 AWS Aurora 集群时偶尔出现“名称解析暂时失败" [英] Occasional 'temporary failure in name resolution' while connecting to AWS Aurora cluster

查看:41
本文介绍了连接到 AWS Aurora 集群时偶尔出现“名称解析暂时失败"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行 Amazon Web Services RDS Aurora 5.6 数据库集群.有几个 lambda 与这些数据库实例对话,它们都是用 python 编写的.现在一切运行良好,但突然之间,从几天前开始,python 代码有时开始抛出以下错误:

<块引用>

[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)

这种情况每 1000 个左右的新连接就会发生 1 个.有趣的是,在过去的几天里我没有接触过整个服务(自从它开始发生以来).所有 lambda 都使用官方 MySQL 连接器客户端,并使用以下代码段在每次初始化时进行连接:

import mysql.connector as mysql导入操作系统connection = mysql.connect(user=os.environ['DATABASE_USER'],密码=os.environ['DATABASE_PASSWORD'],database=os.environ['DATABASE_NAME'],host=os.environ['DATABASE_HOST'],自动提交=真)

为了排除这是 Python MySQL 客户端的问题,我添加了以下内容来解析主机:

导入操作系统进口插座主机 = socket.gethostbyname(os.environ['DATABASE_HOST'])

我有时也会收到以下错误:

<块引用>

[ERROR] gaierror: [Errno -2] 名称或服务未知

现在我怀疑这与 DNS 有关,但由于我只是使用集群端点,因此我无能为力.有趣的是,我最近也在不同的地区遇到了完全相同的问题,使用相同的设置(Aurora 5.6 集群,python 中的 lambda 连接到它)并且在那里发生了同样的情况.

我已经尝试重新启动集群中的所有机器,但问题似乎仍然存在.这真的是DNS问题吗?我能做些什么来阻止这种情况发生?

解决方案

AWS Support 告诉我这个错误很可能是由 AWS 的 VPC 中的流量配额引起的.

根据他们关于 DNS 的文档配额:

<块引用>

每个 Amazon EC2 实例限制可以发送的数据包数量到 Amazon 提供的 DNS 服务器,每个最多 1024 个数据包每个网络接口的第二个.此配额无法增加.这Amazon 提供的 DNS 支持的每秒 DNS 查询数服务器因查询类型、响应大小和使用中的协议.欲了解更多信息和建议可扩展的 DNS 架构,请参阅 混合云 DNS 解决方案Amazon VPC 白皮书.

请务必注意,我们在此处查看的指标是每个 ENI 每秒数据包.这有什么重要的?好吧,虽然每个查询的实际数据包数量各不相同,但每个 DNS 查询通常有多个数据包,这可能不是很明显.

虽然在 VPC 流日志中看不到这些数据包,但在查看我自己的数据包捕获时,我可以看到一些包含大约 4 个数据包的分辨率.

不幸的是,我不能对白皮书说太多;在这个阶段,我并没有真正考虑将混合 DNS 服务的实施视为好的"解决方案.

解决方案

我正在寻找方法来降低发生此错误的风险,并在它确实发生时限制其影响.在我看来,实现这一目标有多种选择:

  1. 强制 Lambda 函数在执行任何其他操作之前解析 Aurora 集群的 DNS,并使用私有 IP 地址进行连接并以指数退避方式处理故障.为了最大限度地减少等待 reties 的成本,我将 DNS 解析的总超时设置为 5 秒.该数字包括所有回退等待时间.
  2. 建立许多短期连接会带来潜在的成本高昂的开销,即使您要关闭连接也是如此.考虑在客户端使用连接池,因为人们普遍认为 Aurora 的连接池足以处理许多短期连接的开销.
  3. 尽量不要依赖 DNS.Aurora 会自动处理实例的故障转移和升级/降级,因此重要的是要知道您始终连接到正确的"(或写入,在某些情况下:P)实例.由于对 Aurora 集群的 DNS 名称的更新可能需要时间来传播,即使它是 5 秒的 TTL,最好使用 INFORMATION_SCHEMA.REPLICA_HOST_STATUS 表,其中 MySQL 在近有关数据库实例的实时"元数据.请注意,该表包含集群范围的元数据".如果您是 cbf,请查看选项 4.
  4. 使用智能驱动程序,它:<块引用>

    是具有读取 DB 能力的数据库驱动程序或连接器元数据表中的集群拓扑.它可以路由新的不依赖于单个实例端点的连接高级集群端点.聪明的司机通常也是能够对可用的只读连接进行负载平衡以循环方式复制 Aurora 副本.

不是解决方案

最初,我认为创建一个指向集群的 CNAME 可能是个好主意,但现在我不太确定缓存 Aurora DNS 查询结果是否明智.这有几个原因,这些原因在 Aurora 连接管理手册:

  • 除非您使用智能数据库驱动程序,否则您依赖于 DNS 记录故障转移、实例扩展和负载的更新和 DNS 传播平衡 Aurora 副本.目前,Aurora DNS 区域使用5 秒的短生存时间 (TTL).确保您的网络和客户端配置不会进一步增加 DNS 缓存 TTL

  • Aurora 的集群和读取器端点抽象了角色变化(主实例升级/降级)和拓扑更改(添加和删除实例)发生在数据库集群中

我希望这会有所帮助!

I am running an Amazon Web Services RDS Aurora 5.6 database cluster. There are a couple of lambda's talking to these database instances, all written in python. Now everything was running well, but then suddenly, since a couple of days ago, the python code sometimes starts throwing the following error:

[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)

This happens in 1 every 1000 or so new connections. What is interesting that I haven't touched this whole service in the last couple of days (since it started happening). All lambdas are using the official MySQL-connector client and connect on every initialization with the following snippet:

import mysql.connector as mysql
import os

connection = mysql.connect(user=os.environ['DATABASE_USER'],
                         password=os.environ['DATABASE_PASSWORD'],
                         database=os.environ['DATABASE_NAME'],
                         host=os.environ['DATABASE_HOST'],
                         autocommit=True)

To rule out that this is a problem in the Python MySQL client I added the following to resolve the host:

import os
import socket

host = socket.gethostbyname(os.environ['DATABASE_HOST'])

Also here I sometimes get the following error:

[ERROR] gaierror: [Errno -2] Name or service not known

Now I suspect this has something to do with DNS, but since I'm just using the cluster endpoint there is not much I can do about that. What is interesting is that I also recently encountered exactly the same problem in a different region, with the same setup (Aurora 5.6 cluster, lambda's in python connecting to it) and the same happens there.

I've tried restarting all the machines in the cluster, but the problem still seems to occur. Is this really a DNS issue? What can do I to stop this from happening?

解决方案

AWS Support have told me that this error is likely to be caused by a traffic quota in AWS's VPCs.

According to their documentation on DNS Quotas:

Each Amazon EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface. This quota cannot be increased. The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use. For more information and recommendations for a scalable DNS architecture, see the Hybrid Cloud DNS Solutions for Amazon VPC whitepaper.

It's important to note that the metric we're looking at here is packets per second, per ENI. What's important about this? Well, it may not be immediately obvious that although the actual number of packets per query varies, there are typically multiple packets per DNS query.

While these packets cannot be seen in VPC flow logs, upon reviewing my own packet captures, I can see some resolutions consisting of about 4 packets.

Unfortunately, I can't say much about the whitepaper; at this stage, I'm not really considering the implementation of a hybrid DNS service as a "good" solution.

Solutions

I'm looking into ways to alleviate the risk of this error occurring, and to limit it's impacts when it does occur. As I see it, there are number of options to achieve this:

  1. Force Lambda Functions to resolve the Aurora Cluster's DNS before doing anything else and use the private IP address for the connection and handle failures with an exponential back-off. To minimise the cost of waiting for reties, I've set a total timeout of 5 seconds for DNS resolution. This number includes all back-off wait time.
  2. Making many, short-lived connections comes with a potentially costly overhead, even if you're closing the connection. Consider using connection pooling on the client side, as it is a common misconception that Aurora's connection pooling is sufficient to handle the overhead of many short-lived connections.
  3. Try not to rely on DNS where possible. Aurora automatically handles failover and promotion/demotion of instances, so it's important to know that you're always connected to the "right" (or write, in some cases :P) instance. As updates to the Aurora cluster's DNS name can take time to propagate, even with it's 5 second TTLs, it might be better to make use of the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, in which MySQL exposes " in near-real-time" metadata about DB instances. Note that the table "contains cluster-wide metadata". If you cbf, have a look at option 4.
  4. Use a smart driver, which:

    is a database driver or connector with the ability to read DB cluster topology from the metadata table. It can route new connections to individual instance endpoints without relying on high-level cluster endpoints. A smart driver is also typically capable of load balancing read-only connections across the available Aurora Replicas in a round-robin fashion.

Not solutions

Initially, I thought it might be a good idea to create a CNAME which points to the cluster, but now I'm not so sure that caching Aurora DNS query results is wise. There are a few reasons for this, which are discussed in varying levels of details in The Aurora Connection Management Handbook:

  • Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. Ensure that your network and client configurations don’t further increase the DNS cache TTL

  • Aurora's cluster and reader endpoints abstract the role changes (primary instance promotion/demotion) and topology changes (addition and removal of instances) occurring in the DB cluster

I hope this helps!

这篇关于连接到 AWS Aurora 集群时偶尔出现“名称解析暂时失败"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆