连接到AWS Aurora集群时偶尔出现``名称解析暂时失败'' [英] Occasional 'temporary failure in name resolution' while connecting to AWS Aurora cluster

查看:100
本文介绍了连接到AWS Aurora集群时偶尔出现``名称解析暂时失败''的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行Amazon Web Services RDS Aurora 5.6数据库集群.有几个lambda在谈论这些数据库实例,它们都是用python编写的.现在一切都运行良好,但是突然之间,从几天前开始,Python代码有时开始引发以下错误:

I am running an Amazon Web Services RDS Aurora 5.6 database cluster. There are a couple of lambda's talking to these database instances, all written in python. Now everything was running well, but then suddenly, since a couple of days ago, the python code sometimes starts throwing the following error:

[ERROR] InterfaceError:2003:无法连接到"CLUSTER-DOMAIN:3306"上的MySQL服务器(-3名称解析暂时失败)

[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)

这种情况每1000个新连接中就有1个发生.有趣的是,最近两天我没有涉及到整个服务(因为它开始发生).所有lambda都使用官方的MySQL连接器客户端,并在每次初始化时使用以下代码段进行连接:

This happens in 1 every 1000 or so new connections. What is interesting that I haven't touched this whole service in the last couple of days (since it started happening). All lambdas are using the official MySQL-connector client and connect on every initialization with the following snippet:

import mysql.connector as mysql
import os

connection = mysql.connect(user=os.environ['DATABASE_USER'],
                         password=os.environ['DATABASE_PASSWORD'],
                         database=os.environ['DATABASE_NAME'],
                         host=os.environ['DATABASE_HOST'],
                         autocommit=True)

为了排除这是Python MySQL客户端中的问题,我添加了以下内容来解析主机:

To rule out that this is a problem in the Python MySQL client I added the following to resolve the host:

import os
import socket

host = socket.gethostbyname(os.environ['DATABASE_HOST'])

在这里我有时也会出现以下错误:

Also here I sometimes get the following error:

[ERROR] gaierror:[Errno -2]名称或服务未知

[ERROR] gaierror: [Errno -2] Name or service not known

现在,我怀疑这与DNS有关,但是由于我只是在使用群集终结点,因此我无能为力.有趣的是,我最近在不同的区域也遇到了完全相同的问题,设置相同(Aurora 5.6集群,python中的lambda连接到该集群),并且在此发生了相同的情况.

Now I suspect this has something to do with DNS, but since I'm just using the cluster endpoint there is not much I can do about that. What is interesting is that I also recently encountered exactly the same problem in a different region, with the same setup (Aurora 5.6 cluster, lambda's in python connecting to it) and the same happens there.

我尝试重新启动集群中的所有计算机,但是问题似乎仍然出现.这真的是DNS问题吗?我该如何阻止这种情况的发生?

I've tried restarting all the machines in the cluster, but the problem still seems to occur. Is this really a DNS issue? What can do I to stop this from happening?

推荐答案

AWS支持人员告诉我,此错误很可能是由AWS VPC中的流量配额引起的.

AWS Support have told me that this error is likely to be caused by a traffic quota in AWS's VPCs.

根据他们在上的文档DNS配额:

每个Amazon EC2实例限制可发送的数据包数量每个Amazon最多提供1024个数据包每个网络接口第二个.此配额无法增加.这亚马逊提供的DNS支持的每秒DNS查询数服务器因查询类型,响应大小和使用的协议.有关更多信息和建议可扩展的DNS体系结构,请参见针对Amazon VPC白皮书.

请务必注意,我们在此处查看的指标是每个ENI每秒数据包.重要的是什么?好吧,虽然每个查询的实际数据包数量有所不同,但每个DNS查询通常有多个数据包可能并不立即显而易见.

It's important to note that the metric we're looking at here is packets per second, per ENI. What's important about this? Well, it may not be immediately obvious that although the actual number of packets per query varies, there are typically multiple packets per DNS query.

虽然在VPC流日志中看不到这些数据包,但查看自己的数据包捕获后,我可以看到一些分辨率,其中包含大约4个数据包.

While these packets cannot be seen in VPC flow logs, upon reviewing my own packet captures, I can see some resolutions consisting of about 4 packets.

不幸的是,我不能对白皮书说太多.在这个阶段,我还没有真正考虑将混合DNS服务的实现视为一种好的"解决方案.

Unfortunately, I can't say much about the whitepaper; at this stage, I'm not really considering the implementation of a hybrid DNS service as a "good" solution.

我正在研究减轻此错误发生的风险,并在发生错误时限制其影响的方法.如我所见,有许多实现此目的的选项:

I'm looking into ways to alleviate the risk of this error occurring, and to limit it's impacts when it does occur. As I see it, there are number of options to achieve this:

  1. 强制执行Lambda功能,在执行其他任何操作之前先解析Aurora群集的DNS,然后使用专用IP地址进行连接并通过指数补偿来处理故障.为了最大程度地减少等待联系的成本,我将DNS解析的总超时设置为5秒.此数字包括所有退避等待时间.
  2. 建立许多短暂的连接会带来潜在的昂贵开销,即使您要关闭连接也是如此.考虑在客户端使用连接池,因为人们普遍误认为Aurora的连接池足以应付许多短期连接的开销.
  3. 在可能的情况下,尽量不要依赖DNS.Aurora自动处理实例的故障转移和升级/降级,因此了解您始终连接到正确的"(在某些情况下为:P)实例非常重要.由于更新Aurora群集的DNS名称可能需要花费一些时间才能传播,即使它有5秒钟的TTL,因此最好使用 INFORMATION_SCHEMA.REPLICA_HOST_STATUS 表,其中MySQL在-数据库实例的实时"元数据.请注意,表包含群集范围的元数据".如果您选择cbf,请查看选项4.
  4. 使用智能驱动程序,该驱动程序:
  1. Force Lambda Functions to resolve the Aurora Cluster's DNS before doing anything else and use the private IP address for the connection and handle failures with an exponential back-off. To minimise the cost of waiting for reties, I've set a total timeout of 5 seconds for DNS resolution. This number includes all back-off wait time.
  2. Making many, short-lived connections comes with a potentially costly overhead, even if you're closing the connection. Consider using connection pooling on the client side, as it is a common misconception that Aurora's connection pooling is sufficient to handle the overhead of many short-lived connections.
  3. Try not to rely on DNS where possible. Aurora automatically handles failover and promotion/demotion of instances, so it's important to know that you're always connected to the "right" (or write, in some cases :P) instance. As updates to the Aurora cluster's DNS name can take time to propagate, even with it's 5 second TTLs, it might be better to make use of the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, in which MySQL exposes " in near-real-time" metadata about DB instances. Note that the table "contains cluster-wide metadata". If you cbf, have a look at option 4.
  4. Use a smart driver, which:

是具有读取数据库功能的数据库驱动程序或连接器元数据表中的群集拓扑.它可以路由新无需依赖即可连接到各个实例端点高级群集端点.通常,智能驱动程序也是能够在可用资源之间平衡只读连接的负载循环方式的Aurora复制品.

is a database driver or connector with the ability to read DB cluster topology from the metadata table. It can route new connections to individual instance endpoints without relying on high-level cluster endpoints. A smart driver is also typically capable of load balancing read-only connections across the available Aurora Replicas in a round-robin fashion.

不是解决方案

最初,我认为创建指向群集的CNAME可能是一个好主意,但是现在我不确定缓存Aurora DNS查询结果是否明智.造成这种情况的原因很多,在《 Aurora连接管理手册》 :

除非您使用智能数据库驱动程序,否则您将依赖于DNS记录更新和DNS传播以进行故障转移,实例扩展和负载在整个Aurora副本中保持平衡.目前,Aurora DNS区域使用5秒的短生存时间(TTL).确保您的网络和客户端配置不会进一步增加DNS缓存TTL

Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. Ensure that your network and client configurations don’t further increase the DNS cache TTL

  • Aurora的群集和读取器端点抽象了角色更改(主实例升级/降级)和拓扑更改(添加和删除实例)发生在数据库集群中

    Aurora's cluster and reader endpoints abstract the role changes (primary instance promotion/demotion) and topology changes (addition and removal of instances) occurring in the DB cluster

  • 我希望这会有所帮助!

    这篇关于连接到AWS Aurora集群时偶尔出现``名称解析暂时失败''的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆