节点关闭后不重新加入集群 [英] Node doesn't rejoin cluster after being downed

查看:138
本文介绍了节点关闭后不重新加入集群的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Akka.NET的群集(1.0.5)功能来实现一项服务,该服务由一个主节点组成,该主节点通过HTTP接收请求,并将工作结果分发给已加入群集的工作节点.

I'm using Akka.NET's cluster (1.0.5) functionality to implement a service that consists of a master node that receives requests over HTTP and farms the work out to worker nodes that have joined the cluster.

该想法是为了能够轻松完成以下任务:

The idea is to be able to easily accomplish the following:

  • 在需求高(检查)时将工作节点添加到集群中

  • add worker nodes to cluster when demand is high (check)

能够重新启动主节点或使其脱机(维护/故障/其他),并让工作人员在可用时重新连接(检查)

be able to reboot the master node or take it offline (maintentance/misbehaviour/whatever) and have the workers reconnect when it becomes available (check)

升级/重启行为异常的工作程序,并使其重新连接到主节点(失败!)

upgrade/reboot a misbehaving worker and have it reconnect to the master node (fail!)

第一点如您所愿:旋转一个新实例(Azure Cloud Service工作角色),并加入主节点-这也是种子节点.

The first point works as you'd expect: a new instance (Azure Cloud Service worker role) is spun up, and joins the master - which is also the seed node.

第二点,所有工作节点都有一个参与者,该参与者监听集群闲话,并确定主节点是否已死亡.在这种情况下,工作节点参与者系统将重新启动.

For the second point, all worker nodes have an actor that listens to cluster gossip and it determines if the master node has died. If this is the case, the worker node actor system will be rebooted.

最后一点是我被困住的地方.主节点还侦听群集闲话,以确定何时工作人员变得不可达(ClusterEvent.UnreachableMember)或正在关闭(退出状态),并确定是否应将其关闭.根据我从文档中了解到的信息,具有相同节点的新"版本重新加入群集的唯一方法是首先降低旧版本.

The last point is where I'm stuck. The master node also listens to cluster gossip to determine when a worker has become unreachable (ClusterEvent.UnreachableMember) or is shutting down (Exiting status) and decides if it should be downed. According to what I've understood from documentation, the only way to have a "new" version of the same node rejoin the cluster is to down the old version first.

不幸的是,这似乎没有发生.在测试场景中,我运行了在计算机仿真器中本地重现问题的步骤,这些步骤是:

Unfortunately this doesn't seem to be happening. In the test scenario I ran to reproduce the problem locally in the compute emulator, these were the steps:

  1. 启动主节点(端口8090)

  1. Start the master node (port 8090)

启动工作程序节点(端口9090)

Start the worker node (port 9090)

做一些工作

突然杀死工作节点

开始备份工作节点

以下是我在此测试期间为两个节点收集的日志中的相关片段:

Below are relevant snippets from the logs I collected for both nodes during this test:

主版:

工作人员无法访问:

[WARNING][07/12/2015 20:39:35][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://InventoryService@0.0.0.0:9090, status = Up]

主节点在工作人员的地址上调用Cluster.Leave()Cluster.Down():

Master node calls Cluster.Leave() and Cluster.Down() on the worker's address:

[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Leave
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marked address [akka.tcp://InventoryService@0.0.0.0:9090] as Leaving]
[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Down
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marking unreachable node [akka.tcp://InventoryService@0.0.0.0:9090] as Down
[DEBUG][07/12/2015 20:39:35][Thread 0020][[akka://InventoryService/system/cluster/core/daemon/heartbeatSender]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Heartbeat to [akka.tcp://InventoryService@0.0.0.0:9090]
[INFO][07/12/2015 20:39:36][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Leader is removing unreachable node [akka.tcp://InventoryService@0.0.0.0:9090]

Master确认不再允许旧节点加入(不过似乎有一个错误,请参见第一行-gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms,我想这是应该被限制的时间):

Master confirms the old node will no longer be allowed to join (seems to have a bug though, see the first line - gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms, which I imagine would be the time it is supposed to be gated):

[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] with unknown UID is reported as quarantined, but address cannot be quarantined without knowing the UID, gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms
[DEBUG][07/12/2015 20:39:36][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%400.0.0.0%3a9090-2/endpointWriter]] Disassociated [akka.tcp://InventoryService@127.0.0.1:8090] -> akka.tcp://InventoryService@0.0.0.0:9090
[DEBUG][07/12/2015 20:39:36][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

工作人员启动并尝试连接到主服务器:

Worker boots and tries to connect to the master:

[DEBUG][07/12/2015 20:40:20][Thread 0013][remoting] Associated [akka.tcp://InventoryService@127.0.0.1:8090] <- akka.tcp://InventoryService@0.0.0.0:9090
[DEBUG][07/12/2015 20:40:21][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:21][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:23][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:28][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:33][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:38][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:43][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:48][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:53][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:58][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:03][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:08][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:13][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:18][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin

这是怎么回事?

工作人员:

被杀死后重新启动:

[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:20][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:21][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%40127.0.0.1%3a8090-1/endpointWriter]] Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,smallBackoffCount: 0, noBackoffCount: 0,adaptiveBackoff: 10000

就这样...其他什么都不会写入日志!

And thats it...nothing else gets written to the log!

完整的日志文件:

工作人员: http://pastebin.com/raw.php?i=QGPxkqEd

主集群配置:

cluster {
    seed-nodes = ["master's address here"]
    roles = [ InventoryServiceMaster, InventoryServiceWorker ]
    failure-detector {
        acceptable-heartbeat-pause = 5s
        threshold = 10.0
    }
}

工作人员的配置相同,但仅具有InventoryServiceWorker角色.

Worker's config is the same, but only has the InventoryServiceWorker role.

我在这里想念什么?这是配置问题吗? (我希望它不是错误-我已经看到其他人报告了类似的问题在Github上).

What am I missing here? Is this a configuration problem? (I'm hoping its not a bug - I've seen someone else report a similar problem on Github).

请明确一点,由于它包含序列化错误,因此我没有使用Nuget的Akka.dll-我检查了当前的主版本是否已应用此修复程序,并进行了发布.日志中包含调试信息,因为我从构建中保留了PDB.

Just to be clear, I'm not using the Akka.dll from Nuget since it contains a serialization bug - I checked ou the current master applied the fix and did a Release build. The logs have debug information because I kept the PDB from the build.

在工作日志中,重新引导后,事件Akka.Cluster.InternalClusterAction+JoinSeedNodes出现两次,因为我最初手动调用了Cluster.JoinSeedNodes().从那以后,我删除了此内容,但结果仍然相同.

In the worker log, after rebooting, the event Akka.Cluster.InternalClusterAction+JoinSeedNodes appears twice because I originally had a manual call to Cluster.JoinSeedNodes(). I've since removed this but the result is still the same.

推荐答案

从Akka.NET 1.1开始,此问题已得到解决-在此之前,我们的UID系统未正确实现(在本文发布时为1.0.5) ),但现在可以正常工作.

This has been resolved as of Akka.NET 1.1 - our UID system wasn't implemented correctly prior to that (1.0.5, at the time of this post) but it works fine now.

这篇关于节点关闭后不重新加入集群的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆