JBoss 4.2.2 节点开始集群然后互相猜疑 [英] JBoss 4.2.2 nodes start to cluster then suspect each other

查看:37
本文介绍了JBoss 4.2.2 节点开始集群然后互相猜疑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在现有 Red Hat 服务器上运行 JBoss 4.2.2 的网站.我正在设置第二台服务器,以便拥有一个集群对(然后将进行负载平衡).但是,我无法让它们成功集群.

I have a website running with JBoss 4.2.2 on an existing Red Hat server. I'm setting up a second server so as to have a clustered pair (which will then be load-balanced). However, I can't get them to cluster successfully.

现有服务器使用以下命令启动 JBoss:

The existing server starts up JBoss with:

run.sh -c default -b 0.0.0.0

(我知道默认"配置不支持开箱即用的集群 - 我正在使用它的修改版本,其中包括集群支持.)当我使用相同的命令启动第二个 JBoss 实例时,它会形成自己的集群,而不会注意到第一个.两者都使用相同的分区名称和多播地址和端口.

(I know the 'default' configuration doesn't support clustering out of the box - I'm using a modified version of it which includes clustering support.) When I start the second JBoss instance with the same command, it forms its own cluster without noticing the first. Both use the same partition name and multicast address and port.

我尝试了 McastReceiverTest 和 McastSenderTest 程序来检查机器是否可以通过多播进行通信;他们可以.

I tried the McastReceiverTest and McastSenderTest programs to check that the machines could communicate over multicast; they could.

然后我注意到 http://docs.jboss.org/jbossas/docs/Clustering_Guide/beta422/html/ch07s07s07.html,说JGroups不能绑定所有接口,而是绑定到默认接口;所以大概它绑定到 127.0.0.1,从而无法通过消息.因此,我将实例设置为告诉 JGroups 使用内部 IP:

I then noticed the info at http://docs.jboss.org/jbossas/docs/Clustering_Guide/beta422/html/ch07s07s07.html, saying that JGroups cannot bind to all interfaces, and instead binds to the default interface; so presumably it was binding to 127.0.0.1, and thereby not getting the messages through. So instead I set the instances to tell JGroups to use the internal IPs:

run.sh -c default -b 0.0.0.0 -Djgroups.bind_addr=10.51.1.131
run.sh -c default -b 0.0.0.0 -Djgroups.bind_addr=10.51.1.141

(.131 是现有服务器,.141 是新服务器).

(.131 is the existing server, .141 is the new server).

节点现在相互注意并形成一个集群 - 首先是.但是,在尝试部署 .ear 时,服务器日志显示:

The nodes now notice each other and form a cluster - at first. However, while trying to deploy the .ear, the server log says this:

2010-08-07 22:26:39,321 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:46294 (own address=10.51.1.141:47629)
2010-08-07 22:26:45,412 WARN  [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48733; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 22:26:49,324 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:46294 (own address=10.51.1.141:47629)
2010-08-07 22:26:49,324 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:46294 (number=0)
2010-08-07 22:26:49,529 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=10.51.1.141:60365, coord_addr=10.51.1.141:60365, is_server=true]]
2010-08-07 22:26:52,092 WARN  [org.jboss.cache.TreeCache] replication failure with method_call optimisticPrepare; id:18; Args: ( arg[0] = GlobalTransaction:<10.51.1.131:46294>:5421085 ...) exception org.jboss.cache.lock.TimeoutException: failure acquiring lock: fqn=/Yudu_ear,Yudu-ejb_jar,Yudu-ejbPU/com/yudu/ejb/entity, caller=GlobalTransaction:<10.51.1.131:46294>:5421085, lock=read owners=[GlobalTransaction:<10.51.1.131:46294>:5421081] (activeReaders=1, activeWriter=null, waitingReaders=0, waitingWriters=1, waitingUpgrader=0)

...并且 .ear 无法部署.

...and the .ear fails to deploy.

如果我将 ejb3-entity-cache-service.xml 中的 CacheMode 从 REPL_SYNC 更改为 LOCAL,.ear 将正确部署,当然实体缓存复制不会发生.但是,日志仍然显示出相同问题的有趣迹象.

If I change CacheMode in ejb3-entity-cache-service.xml from REPL_SYNC to LOCAL, the .ear deploys correctly, although of course the entity cache replication then doesn't happen. However, the log still shows interesting signs of the same problem.

看起来像:

  • 首先新节点找到现有节点并形成一个集群
  • 然后FD检查失败,在一定次数的失败后,新节点从集群中分离出来并形成自己的集群
  • 然后它再次找到它,重新集群,这次 FD 检查工作.

日志文件的相关部分:

2010-08-07 23:47:07,423 INFO  [org.jgroups.protocols.UDP] socket information: local_addr=10.51.1.141:35666, mcast_addr=228.1.2.3:45566, bind_addr=/10.51.1.141, ttl=2 sock: bound to 10.51.1.141:35666, receive buffer size=131071, send buffer size=131071 mcast_recv_sock: bound to 0.0.0.0:45566, send buffer size=131071, receive buffer size=131071 mcast_send_sock: bound to 10.51.1.141:59196, send buffer size=131071, receive buffer size=131071
2010-08-07 23:47:07,431 DEBUG [org.jgroups.protocols.UDP] created unicast receiver thread
2010-08-07 23:47:09,445 DEBUG [org.jgroups.protocols.pbcast.GMS] initial_mbrs are [[own_addr=10.51.1.131:48888, coord_addr=10.51.1.131:48888, is_server=true]]
2010-08-07 23:47:09,446 DEBUG [org.jgroups.protocols.pbcast.GMS] election results: {10.51.1.131:48888=1}
2010-08-07 23:47:09,446 DEBUG [org.jgroups.protocols.pbcast.GMS] sending handleJoin(10.51.1.141:35666) to 10.51.1.131:48888
2010-08-07 23:47:09,751 DEBUG [org.jgroups.protocols.pbcast.GMS] [10.51.1.141:35666]: JoinRsp=[10.51.1.131:48888|61] [10.51.1.131:48888, 10.51.1.141:35666] [size=2]
2010-08-07 23:47:09,752 DEBUG [org.jgroups.protocols.pbcast.GMS] new_view=[10.51.1.131:48888|61] [10.51.1.131:48888, 10.51.1.141:35666]
...
2010-08-07 23:47:10,047 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Number of cluster members: 2
2010-08-07 23:47:10,047 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Other members: 1
...
2010-08-07 23:47:20,034 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:30,037 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:30,038 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:48888 (number=0)
2010-08-07 23:47:40,040 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:40,040 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:48888 (number=1)
...
2010-08-07 23:48:19,758 WARN  [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:48:20,054 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:48:20,055 DEBUG [org.jgroups.protocols.FD] [10.51.1.141:35666]: received no heartbeat ack from 10.51.1.131:48888 for 6 times (60000 milliseconds), suspecting it
2010-08-07 23:48:20,058 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[10.51.1.131:48888]] to group
...
2010-08-07 23:48:21,691 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing 10.51.1.131:48888 from received_msgs (not member anymore)
2010-08-07 23:48:21,691 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (127.0.0.1:1099) received membershipChanged event:
2010-08-07 23:48:21,691 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 0 ([])
2010-08-07 23:48:21,691 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
2010-08-07 23:48:21,691 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([127.0.0.1:1099])
...
2010-08-07 23:49:59,793 WARN  [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:50:09,796 WARN  [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:50:19,144 DEBUG [org.jgroups.protocols.FD] Recevied Ack. is invalid (was from: 10.51.1.131:48888),
2010-08-07 23:50:19,144 DEBUG [org.jgroups.protocols.FD] Recevied Ack. is invalid (was from: 10.51.1.131:48888),
...
2010-08-07 23:50:21,791 DEBUG [org.jgroups.protocols.pbcast.GMS] new=[10.51.1.131:48902], suspected=[], leaving=[], new view: [10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
...
2010-08-07 23:50:21,792 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
2010-08-07 23:50:21,792 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.51.1.141:35666] view is [10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
2010-08-07 23:50:21,822 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 63, delta: 1) : [127.0.0.1:1099, 127.0.0.1:1099]
2010-08-07 23:50:21,822 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] membership changed from 1 to 2
...
2010-08-07 23:50:31,825 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48902 (own address=10.51.1.141:35666)
2010-08-07 23:50:31,832 DEBUG [org.jgroups.protocols.FD] received ack from 10.51.1.131:48902

但我不明白为什么 FD 检查在第一轮失败;尽管它最终似乎与另一个节点形成集群,但最初的失败似乎足以在它尝试共享实体状态时弄乱部署,从而阻止它以有用的方式实际工作.

But I'm at a loss to understand why the FD checks fail the first time round; and although it eventually seems to cluster with the other node, the initial failure seems to be enough to mess up the deployment when it tries to share entity state, and thereby prevent it from actually working in a useful way.

如果有人能对此有所了解,我将不胜感激!

If anyone can shed light on this I'll be hugely grateful!

推荐答案

我认为在您转向 JBoss 4.2.3(这可能是最终的好地方)或构建新配置之前(我同意@skaffman 关于修剪比添加更容易),您可能想尝试以下操作:

I think that before you move on to JBoss 4.2.3 (which is probably a good place to be eventually) or building a new configuration (I agree with @skaffman about pruning being easier than adding), you might want to try the following:

在 10.51.1.131:

On 10.51.1.131:

run.sh -c default -b 10.51.1.131 -Djgroups.bind_addr=10.51.1.131

在 10.51.1.141:

On 10.51.1.141:

run.sh -c default -b 10.51.1.141 -Djgroups.bind_addr=10.51.1.141

根据我能找到的所有文档,-b 参数是服务器实例绑定地址,让它们不同可能会给 JGroups 造成一些严重的精神分裂症.我有一个四服务器集群环境成功运行了三年多,这是 RH/JBoss 推荐配置的一部分(我们有支持合同,并得到了 Bela Ban 的帮助).

According to all the documentation I can find on this, the -b parameter is the server instance bind address, and having them be different might be creating some significant schizophrenia for JGroups. I had a four-server clustered environment working successfully for over three years, and that was part of the recommended configuration from RH/JBoss (we had a support contract, and got help from Bela Ban).

这篇关于JBoss 4.2.2 节点开始集群然后互相猜疑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆