配置Hazelcast CPSubsystem重试超时 [英] Configure Hazelcast CPSubsystem Retries Timeout
问题描述
当前,我在CPSubsystem
中注册了三个实例.
Currently I have three instances registered in the CPSubsystem
.
-----
| I1* | * Leader
-----
---- ----
| I2 | | I3 |
---- ----
当所有实例都在运行时,所有实例都已注册并在CPSubsystem
上彼此看到对方,一切按预期进行.以下调用用于在所有实例之间执行分布式锁定:
When all instances are up an running, all registered and seeing each other on the CPSubsystem
everything is working as expected. The following call is used to perform distributed locks between all the instances:
getHazelcastInstance().getCpSubsystem().getLock(lockDefinition.getLockEntryName())
当其中两个实例死亡时,我注意到一个问题,并且没有领导者或其他可用于执行领导者选举的实例:
I noticed an issue when two of these instances die, and there is no leader or other instances available to perform the leader election:
-----
| XXX | * DEAD
-----
---- -----
| I2 | | XXX | * DEAD
---- -----
然后,正在运行的实例尝试获取分布式锁,但是请求冻结执行getLock
方法,导致请求排队等待几分钟(当实例成为实例中的唯一实例时,需要配置超时时间)子系统).
The running instance then tries to acquire a distributed lock, but the request freezes executing the getLock
method, causing the requests to queue for minutes (there is the need to configure the timeout when the instance become the only one in the subsystem).
我还注意到以下日志将永远打印:
I have also noticed the following log being printed forever:
2019-08-16 10:56:21.697 WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:23.737 WARN 1337 --- [cached.thread-8] c.h.nio.tcp.TcpIpConnectionErrorHandler : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 106
2019-08-16 10:56:23.927 WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:26.006 WARN 1337 --- [onMonitorThread] c.h.s.i.o.impl.Invocation : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1468, invocationTime=1565963786004 (2019-08-16 10:56:26.004), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=130, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.core.MemberLeftException: Member [127.0.0.1]:5702 - ab45ea09-c8c9-4f03-b3db-42b7b440d161 this has left cluster!
2019-08-16 10:56:26.232 WARN 1337 --- [cached.thread-8] c.h.nio.tcp.TcpIpConnectionErrorHandler : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 107
2019-08-16 10:56:26.413 WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:27.143 WARN 1337 --- [onMonitorThread] c.h.s.i.o.impl.Invocation : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1479, invocationTime=1565963787142 (2019-08-16 10:56:27.142), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=140, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5703, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, partitionId: 81, operation: com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp, service: hz:core:raft
2019-08-16 10:56:28.835 WARN 1337 --- [cached.thread-6] c.h.nio.tcp.TcpIpConnectionErrorHandler : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 108
2019-08-16 10:56:28.941 WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:31.038 WARN 1337 --- [cached.thread-3] c.h.nio.tcp.TcpIpConnectionErrorHandler : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 109
2019-08-16 10:56:31.533 WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:31.555 WARN 1337 --- [.async.thread-3] c.h.s.i.o.impl.Invocation : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1493, invocationTime=1565963791554 (2019-08-16 10:56:31.554), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=150, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.cp.exception.NotLeaderException: CPMember{uuid=ab45ea09-c8c9-4f03-b3db-42b7b440d161, address=[127.0.0.1]:5702} is not LEADER of CPGroupId{name='default', seed=0, commitIndex=6}. Known leader is: N/A
是否有一种方法可以识别实例现在正在单独运行,如果是,则不要在获取新锁的过程中阻止应用程序?
Is there a way to identify that the instance is now running alone, and if so, do not block the application during the acquisition of a new lock?
我一直在寻找某种机制,以任何方式都不阻止应用程序的流动,即使应用程序单独运行,我也会使用常规的j.u.c.l.ReentrantLock
而不是FencedLock
.
I was looking to some mechanism to not block the flow of the application in any way, even if the application is running alone I would use a regular j.u.c.l.ReentrantLock
instead of the FencedLock
.
推荐答案
经过几天的测试,我得出以下结论:
After a few days of testing, I came to the following conclusion:
- 但是
CPSubsystem
至少需要三个模块才能开始工作,两个实例可以正常运行 - 在我介绍的最具灾难性的情况下(仅运行一个实例),没有什么可做的,您的环境可能正在闲逛,可能需要某种干预或关注需要解决这种干扰
- However
CPSubsystem
demands at least three modules to start working, it is fine to have two instances running - In the most catastrophic possible scenario I presented (having just one instance running), there is nothing much to do, your environment probably is having a ruff time, some kind of intervention or attention will be needed to solve this interruption
在这种情况下,为了保持模块之间所有操作的一致性,我决定阻止要完成的请求.
I decided to prevent the request to be fulfilled in the case this scenario happens to keep consistency of all operations between modules.
This decision was made reading a lot of material (here, here, here, here, here and also simulating the scenario over here).
因此,方法如下:
try {
if( !hz.isCpInstanceAvailable() ) {
throw new HazelcastUnavailableException("CPSubsystem is not available");
}
... acquires the lock ...
} catch (HazelcastUnavailableException e) {
LOG.error("Error retrieving Hazelcast Distributed Lock :( Please check the CPSubsystem health among all instances", e);
throw e;
}
方法isCpInstanceAvailable
将执行三个验证:
- 如果当前应用程序已在
CPSubsystem
上注册
- 如果
CPSubsystem
已启动 -
CPSubsystem
中是否有最少的成员
- If the current application is registered on the
CPSubsystem
- If the
CPSubsystem
is up - If there a minimum of members available in the
CPSubsystem
这是解决方案:
protected boolean isCpInstanceAvailable() {
try {
return getCPLocalMember() != null && getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS).size() > ONE_MEMBER;
} catch (InterruptedException | ExecutionException | TimeoutException e) {
LOG.error("Issue retrieving CP Members", e);
}
return false;
}
protected ICompletableFuture<Collection<CPMember>> getCPMembers() {
return Optional.ofNullable(getCPSubsystemManagementService().getCPMembers()).orElseThrow(
() -> new HazelcastUnavailableException("CP Members not available"));
}
protected CPMember getCPLocalMember() {
return getCPSubsystemManagementService().getLocalCPMember();
}
问题来了,简单地调用getCPMembers().get()
会导致我遇到长时间的停顿(默认超时).
Here comes the issue, simply calling getCPMembers().get()
would cause the long pause I was experiencing (default timeout).
所以我使用了getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS)
,如果调用超过了预期的超时,它将抛出一个错误.
So I used the getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS)
, which will throw an error if the call exceeds the expected timeout.
这篇关于配置Hazelcast CPSubsystem重试超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!