JBoss 4.2.2节点开始集群然后互相怀疑 [英] JBoss 4.2.2 nodes start to cluster then suspect each other

查看:165
本文介绍了JBoss 4.2.2节点开始集群然后互相怀疑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在现有Red Hat服务器上运行JBoss 4.2.2的网站.我正在设置第二台服务器,以便拥有一个集群对(然后将进行负载均衡).但是,我无法让它们成功集群.

I have a website running with JBoss 4.2.2 on an existing Red Hat server. I'm setting up a second server so as to have a clustered pair (which will then be load-balanced). However, I can't get them to cluster successfully.

现有服务器使用以下命令启动JBoss:

The existing server starts up JBoss with:

run.sh -c default -b 0.0.0.0

(我知道默认"配置不支持现成的群集-我使用的是它的修改版,其中包括群集支持.) 当我使用相同的命令启动第二个JBoss实例时,它会形成自己的集群,而不会注意到第一个.两者都使用相同的分区名称,多播地址和端口.

(I know the 'default' configuration doesn't support clustering out of the box - I'm using a modified version of it which includes clustering support.) When I start the second JBoss instance with the same command, it forms its own cluster without noticing the first. Both use the same partition name and multicast address and port.

我尝试了McastReceiverTest和McastSenderTest程序来检查机器是否可以通过多播进行通信.他们可以.

I tried the McastReceiverTest and McastSenderTest programs to check that the machines could communicate over multicast; they could.

然后我在 http://docs上注意到了该信息.jboss.org/jbossas/docs/Clustering_Guide/beta422/html/ch07s07s07.html ,说JGroups不能绑定到 all 接口,而是绑定到默认接口;因此大概是它绑定到127.0.0.1,从而无法使消息通过.因此,我将实例设置为告诉JGroups使用内部IP:

I then noticed the info at http://docs.jboss.org/jbossas/docs/Clustering_Guide/beta422/html/ch07s07s07.html, saying that JGroups cannot bind to all interfaces, and instead binds to the default interface; so presumably it was binding to 127.0.0.1, and thereby not getting the messages through. So instead I set the instances to tell JGroups to use the internal IPs:

run.sh -c default -b 0.0.0.0 -Djgroups.bind_addr=10.51.1.131
run.sh -c default -b 0.0.0.0 -Djgroups.bind_addr=10.51.1.141

(.131是现有服务器,.141是新服务器).

(.131 is the existing server, .141 is the new server).

现在,节点相互注意并形成一个集群-首先.但是,在尝试部署.ear时,服务器日志显示如下:

The nodes now notice each other and form a cluster - at first. However, while trying to deploy the .ear, the server log says this:

2010-08-07 22:26:39,321 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:46294 (own address=10.51.1.141:47629)
2010-08-07 22:26:45,412 WARN  [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48733; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 22:26:49,324 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:46294 (own address=10.51.1.141:47629)
2010-08-07 22:26:49,324 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:46294 (number=0)
2010-08-07 22:26:49,529 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=10.51.1.141:60365, coord_addr=10.51.1.141:60365, is_server=true]]
2010-08-07 22:26:52,092 WARN  [org.jboss.cache.TreeCache] replication failure with method_call optimisticPrepare; id:18; Args: ( arg[0] = GlobalTransaction:<10.51.1.131:46294>:5421085 ...) exception org.jboss.cache.lock.TimeoutException: failure acquiring lock: fqn=/Yudu_ear,Yudu-ejb_jar,Yudu-ejbPU/com/yudu/ejb/entity, caller=GlobalTransaction:<10.51.1.131:46294>:5421085, lock=read owners=[GlobalTransaction:<10.51.1.131:46294>:5421081] (activeReaders=1, activeWriter=null, waitingReaders=0, waitingWriters=1, waitingUpgrader=0)

...并且.ear无法部署.

...and the .ear fails to deploy.

如果我将ejb3-entity-cache-service.xml中的CacheMode从REPL_SYNC更改为LOCAL,则.ear会正确部署,尽管当然不会发生实体缓存复制.但是,日志中仍然显示出有趣的迹象表明存在相同的问题.

If I change CacheMode in ejb3-entity-cache-service.xml from REPL_SYNC to LOCAL, the .ear deploys correctly, although of course the entity cache replication then doesn't happen. However, the log still shows interesting signs of the same problem.

它看起来像:

  • 首先,新节点找到现有节点并形成集群
  • 然后FD检查失败,并且在发生一定次数的失败后,新节点从群集中分离出来,并形成了自己的一个群集.
  • 然后再次找到它,重新进行集群,这一次FD检查工作.

日志文件的相关位:

2010-08-07 23:47:07,423 INFO  [org.jgroups.protocols.UDP] socket information: local_addr=10.51.1.141:35666, mcast_addr=228.1.2.3:45566, bind_addr=/10.51.1.141, ttl=2 sock: bound to 10.51.1.141:35666, receive buffer size=131071, send buffer size=131071 mcast_recv_sock: bound to 0.0.0.0:45566, send buffer size=131071, receive buffer size=131071 mcast_send_sock: bound to 10.51.1.141:59196, send buffer size=131071, receive buffer size=131071
2010-08-07 23:47:07,431 DEBUG [org.jgroups.protocols.UDP] created unicast receiver thread
2010-08-07 23:47:09,445 DEBUG [org.jgroups.protocols.pbcast.GMS] initial_mbrs are [[own_addr=10.51.1.131:48888, coord_addr=10.51.1.131:48888, is_server=true]]
2010-08-07 23:47:09,446 DEBUG [org.jgroups.protocols.pbcast.GMS] election results: {10.51.1.131:48888=1}
2010-08-07 23:47:09,446 DEBUG [org.jgroups.protocols.pbcast.GMS] sending handleJoin(10.51.1.141:35666) to 10.51.1.131:48888
2010-08-07 23:47:09,751 DEBUG [org.jgroups.protocols.pbcast.GMS] [10.51.1.141:35666]: JoinRsp=[10.51.1.131:48888|61] [10.51.1.131:48888, 10.51.1.141:35666] [size=2]
2010-08-07 23:47:09,752 DEBUG [org.jgroups.protocols.pbcast.GMS] new_view=[10.51.1.131:48888|61] [10.51.1.131:48888, 10.51.1.141:35666]
...
2010-08-07 23:47:10,047 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Number of cluster members: 2
2010-08-07 23:47:10,047 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Other members: 1
...
2010-08-07 23:47:20,034 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:30,037 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:30,038 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:48888 (number=0)
2010-08-07 23:47:40,040 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:40,040 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:48888 (number=1)
...
2010-08-07 23:48:19,758 WARN  [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:48:20,054 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:48:20,055 DEBUG [org.jgroups.protocols.FD] [10.51.1.141:35666]: received no heartbeat ack from 10.51.1.131:48888 for 6 times (60000 milliseconds), suspecting it
2010-08-07 23:48:20,058 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[10.51.1.131:48888]] to group
...
2010-08-07 23:48:21,691 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing 10.51.1.131:48888 from received_msgs (not member anymore)
2010-08-07 23:48:21,691 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (127.0.0.1:1099) received membershipChanged event:
2010-08-07 23:48:21,691 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 0 ([])
2010-08-07 23:48:21,691 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
2010-08-07 23:48:21,691 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([127.0.0.1:1099])
...
2010-08-07 23:49:59,793 WARN  [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:50:09,796 WARN  [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:50:19,144 DEBUG [org.jgroups.protocols.FD] Recevied Ack. is invalid (was from: 10.51.1.131:48888),
2010-08-07 23:50:19,144 DEBUG [org.jgroups.protocols.FD] Recevied Ack. is invalid (was from: 10.51.1.131:48888),
...
2010-08-07 23:50:21,791 DEBUG [org.jgroups.protocols.pbcast.GMS] new=[10.51.1.131:48902], suspected=[], leaving=[], new view: [10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
...
2010-08-07 23:50:21,792 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
2010-08-07 23:50:21,792 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.51.1.141:35666] view is [10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
2010-08-07 23:50:21,822 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 63, delta: 1) : [127.0.0.1:1099, 127.0.0.1:1099]
2010-08-07 23:50:21,822 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] membership changed from 1 to 2
...
2010-08-07 23:50:31,825 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48902 (own address=10.51.1.141:35666)
2010-08-07 23:50:31,832 DEBUG [org.jgroups.protocols.FD] received ack from 10.51.1.131:48902

但是我不知所措,为什么FD检查第一次失败?尽管最终看起来似乎与另一个节点成簇,但最初的失败似乎足以使部署尝试共享实体状态时弄乱了它,从而阻止了它以有用的方式实际工作.

But I'm at a loss to understand why the FD checks fail the first time round; and although it eventually seems to cluster with the other node, the initial failure seems to be enough to mess up the deployment when it tries to share entity state, and thereby prevent it from actually working in a useful way.

如果有人可以阐明这一点,我将非常感激!

If anyone can shed light on this I'll be hugely grateful!

推荐答案

我认为,在继续使用JBoss 4.2.3(最终可能是一个不错的地方)之前,或者构建一个新配置(我同意@) skaffman认为修剪比添加要容易),您可能需要尝试以下操作:

I think that before you move on to JBoss 4.2.3 (which is probably a good place to be eventually) or building a new configuration (I agree with @skaffman about pruning being easier than adding), you might want to try the following:

在10.51.1.131上:

On 10.51.1.131:

run.sh -c default -b 10.51.1.131 -Djgroups.bind_addr=10.51.1.131

在10.51.1.141上:

On 10.51.1.141:

run.sh -c default -b 10.51.1.141 -Djgroups.bind_addr=10.51.1.141

根据我在此上可以找到的所有文档,-b参数是服务器实例绑定地址,而使它们不同可能会为JGroups造成一些重大的精神分裂症.我有一个四服务器集群环境,可以成功运行三年以上,这是RH/JBoss推荐的配置的一部分(我们已经获得了支持合同,并获得了Bela Ban的帮助).

According to all the documentation I can find on this, the -b parameter is the server instance bind address, and having them be different might be creating some significant schizophrenia for JGroups. I had a four-server clustered environment working successfully for over three years, and that was part of the recommended configuration from RH/JBoss (we had a support contract, and got help from Bela Ban).

这篇关于JBoss 4.2.2节点开始集群然后互相怀疑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆