网络故障后 RabbitMQ 集群未重新连接 [英] RabbitMQ cluster is not reconnecting after network failure

查看:31
本文介绍了网络故障后 RabbitMQ 集群未重新连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有两个生产节点的 RabbitMQ 集群,集群因以下错误消息而中断:

I have a RabbitMQ cluster with two nodes in production and the cluster is breaking with these error messages:

=ERROR REPORT==== 23-Dec-2011::04:21:34 ===
** Node rabbit@rabbitmq02 not responding **
** Removing (timedout) connection **

=INFO REPORT==== 23-Dec-2011::04:21:35 ===
node rabbit@rabbitmq02 lost 'rabbit'

=ERROR REPORT==== 23-Dec-2011::04:21:49 ===
Mnesia(rabbit@rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbitmq02}

我试图通过使用tcpkill"终止两个节点之间的连接来模拟问题.集群已断开连接,令人惊讶的是两个节点并没有尝试重新连接!

I tried to simulate the problem by killing the connection between the two nodes using "tcpkill". The cluster has disconnected, and surprisingly the two nodes are not trying to reconnect!

当集群中断时,HAProxy 负载平衡器仍将两个节点标记为活动节点并向它们发送请求,尽管它们不在集群中.

When the cluster breaks, HAProxy load balancer still marks both nodes as active and send requests to both of them, although they are not in a cluster.

我的问题:

  1. 如果节点被配置为集群工作,当我遇到网络故障时,他们为什么不尝试重新连接?

  1. If the nodes are configured to work as a cluster, when I get a network failure, why aren't they trying to reconnect afterwards?

如何识别损坏的集群并关闭其中一个节点?分别使用两个节点时,我遇到了一致性问题.

How can I identify broken cluster and shutdown one of the nodes? I have consistency problems when working with the two nodes separately.

推荐答案

从这种故障中恢复的另一种方法是使用 Mnesia,它是 RabbitMQ 用作持久性机制并用于 RabbitMQ 同步的数据库实例(和主/从状态)受此控制.有关所有详细信息,请参阅以下 URL:http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html

One other way to recover from this kind of failure is to work with Mnesia which is the database that RabbitMQ uses as the persistence mechanism and for the synchronization of the RabbitMQ instances (and the master / slave status) are controlled by this. For all the details, refer to the following URL: http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html

在此处添加相关部分:

Mnesia 可能会在多种情况下检测到网络由于通信故障已被分区.

There are several occasions when Mnesia may detect that the network has been partitioned due to a communication failure.

一个是 Mnesia 已经启动并运行并且 Erlang 节点获得再联系.然后 Mnesia 会尝试在另一端联系 Mnesia节点看它是否也认为网络已经被分区一阵子.如果两个节点上的 Mnesia 都记录了 mnesia_down 条目Mnesia 从彼此生成一个系统事件,称为{inconsistent_database, running_partitioned_network, Node} 这是发送到 Mnesia 的事件处理程序和其他可能的订阅者.这默认事件处理程序向错误记录器报告错误.

One is when Mnesia already is up and running and the Erlang nodes gain contact again. Then Mnesia will try to contact Mnesia on the other node to see if it also thinks that the network has been partitioned for a while. If Mnesia on both nodes has logged mnesia_down entries from each other, Mnesia generates a system event, called {inconsistent_database, running_partitioned_network, Node} which is sent to Mnesia's event handler and other possible subscribers. The default event handler reports an error to the error logger.

Mnesia 可能检测到网络已被破坏的另一种情况由于通信故障而分区,正在启动.如果 Mnesia检测到本地节点和另一个节点都收到了mnesia_down它从彼此生成一个 {inconsistent_database,starting_partitioned_network, Node} 系统事件并按照描述的方式操作以上.

Another occasion when Mnesia may detect that the network has been partitioned due to a communication failure, is at start-up. If Mnesia detects that both the local node and another node received mnesia_down from each other it generates a {inconsistent_database, starting_partitioned_network, Node} system event and acts as described above.

如果应用程序检测到通信失败这可能导致数据库不一致,它可能使用函数 mnesia:set_master_nodes(Tab, Nodes) 以查明从哪个每个表都可以加载节点.

If the application detects that there has been a communication failure which may have caused an inconsistent database, it may use the function mnesia:set_master_nodes(Tab, Nodes) to pinpoint from which nodes each table may be loaded.

在启动时,Mnesia 的正常表加载算法将被绕过并且该表将从为表,无论日志中潜在的 mnesia_down 条目如何.这节点只能包含表有副本的节点,如果它为空,特定表的主节点恢复机制将被重置并在下一次时使用正常加载机制正在重新启动.

At start-up Mnesia's normal table load algorithm will be bypassed and the table will be loaded from one of the master nodes defined for the table, regardless of potential mnesia_down entries in the log. The Nodes may only contain nodes where the table has a replica and if it is empty, the master node recovery mechanism for the particular table will be reset and the normal load mechanism will be used when next restarting.

函数 mnesia:set_master_nodes(Nodes) 设置所有的主节点表.对于每个表,它将确定其副本节点并调用mnesia:set_master_nodes(Tab, TabNodes) 与那些副本节点包含在节点列表中(即 TabNodes 是节点和表的副本节点).如果路口是清空特定表的主节点恢复机制将被重置,下次重启时将使用正常加载机制.

The function mnesia:set_master_nodes(Nodes) sets master nodes for all tables. For each table it will determine its replica nodes and invoke mnesia:set_master_nodes(Tab, TabNodes) with those replica nodes that are included in the Nodes list (i.e. TabNodes is the intersection of Nodes and the replica nodes of the table). If the intersection is empty the master node recovery mechanism for the particular table will be reset and the normal load mechanism will be used at next restart.

函数 mnesia:system_info(master_node_tables) 和mnesia:table_info(Tab, master_nodes) 可用于获取信息关于潜在的主节点.

The functions mnesia:system_info(master_node_tables) and mnesia:table_info(Tab, master_nodes) may be used to obtain information about the potential master nodes.

确定通信失败后保留哪些数据在外面Mnesia 的范围.一种方法是确定哪个岛"包含大多数节点.使用 {majority,true} 选项关键表可以是一种确保不属于一部分的节点的方式多数岛"的人无法更新这些表.注意这构成了少数节点上的服务减少.这将是一个有利于更高一致性保证的权衡.

Determining which data to keep after communication failure is outside the scope of Mnesia. One approach would be to determine which "island" contains a majority of the nodes. Using the {majority,true} option for critical tables can be a way of ensuring that nodes that are not part of a "majority island" are not able to update those tables. Note that this constitutes a reduction in service on the minority nodes. This would be a tradeoff in favour of higher consistency guarantees.

函数 mnesia:force_load_table(Tab) 可用于强制加载无论激活哪种表加载机制,该表.

The function mnesia:force_load_table(Tab) may be used to force load the table regardless of which table load mechanism is activated.

这是一种从此类故障中恢复的更冗长且复杂的方法..但将提供更好的粒度和对最终主节点中应该可用的数据的控制(这可以减少可能导致的数据丢失量)合并" RabbitMQ 主节点时发生).

这篇关于网络故障后 RabbitMQ 集群未重新连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆