配置Hazelcast CPSubsystem重试超时 [英] Configure Hazelcast CPSubsystem Retries Timeout

查看:363
本文介绍了配置Hazelcast CPSubsystem重试超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我在CPSubsystem中注册了三个实例.

Currently I have three instances registered in the CPSubsystem.

      ----- 
     | I1* | * Leader
      ----- 

 ----       ---- 
| I2 |     | I3 |
 ----       ---- 

当所有实例都在运行时,所有实例都已注册并在CPSubsystem上彼此看到对方,一切按预期进行.以下调用用于在所有实例之间执行分布式锁定:

When all instances are up an running, all registered and seeing each other on the CPSubsystem everything is working as expected. The following call is used to perform distributed locks between all the instances:

getHazelcastInstance().getCpSubsystem().getLock(lockDefinition.getLockEntryName())

当其中两个实例死亡时,我注意到一个问题,并且没有领导者或其他可用于执行领导者选举的实例:

I noticed an issue when two of these instances die, and there is no leader or other instances available to perform the leader election:

      ----- 
     | XXX | * DEAD
      ----- 

 ----       ----- 
| I2 |     | XXX | * DEAD
 ----       ----- 

然后,正在运行的实例尝试获取分布式锁,但是请求冻结执行getLock方法,导致请求排队等待几分钟(当实例成为实例中的唯一实例时,需要配置超时时间)子系统).

The running instance then tries to acquire a distributed lock, but the request freezes executing the getLock method, causing the requests to queue for minutes (there is the need to configure the timeout when the instance become the only one in the subsystem).

我还注意到以下日志将永远打印:

I have also noticed the following log being printed forever:

2019-08-16 10:56:21.697  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:23.737  WARN 1337 --- [cached.thread-8] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 106
2019-08-16 10:56:23.927  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:26.006  WARN 1337 --- [onMonitorThread] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1468, invocationTime=1565963786004 (2019-08-16 10:56:26.004), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=130, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.core.MemberLeftException: Member [127.0.0.1]:5702 - ab45ea09-c8c9-4f03-b3db-42b7b440d161 this has left cluster!
2019-08-16 10:56:26.232  WARN 1337 --- [cached.thread-8] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 107
2019-08-16 10:56:26.413  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:27.143  WARN 1337 --- [onMonitorThread] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1479, invocationTime=1565963787142 (2019-08-16 10:56:27.142), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=140, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5703, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, partitionId: 81, operation: com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp, service: hz:core:raft
2019-08-16 10:56:28.835  WARN 1337 --- [cached.thread-6] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 108
2019-08-16 10:56:28.941  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:31.038  WARN 1337 --- [cached.thread-3] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 109
2019-08-16 10:56:31.533  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:31.555  WARN 1337 --- [.async.thread-3] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1493, invocationTime=1565963791554 (2019-08-16 10:56:31.554), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=150, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.cp.exception.NotLeaderException: CPMember{uuid=ab45ea09-c8c9-4f03-b3db-42b7b440d161, address=[127.0.0.1]:5702} is not LEADER of CPGroupId{name='default', seed=0, commitIndex=6}. Known leader is: N/A

是否有一种方法可以识别实例现在正在单独运行,如果是,则不要在获取新锁的过程中阻止应用程序?

Is there a way to identify that the instance is now running alone, and if so, do not block the application during the acquisition of a new lock?

我一直在寻找某种机制,以任何方式都不阻止应用程序的流动,即使应用程序单独运行,我也会使用常规的j.u.c.l.ReentrantLock而不是FencedLock.

I was looking to some mechanism to not block the flow of the application in any way, even if the application is running alone I would use a regular j.u.c.l.ReentrantLock instead of the FencedLock.

推荐答案

经过几天的测试,我得出以下结论:

After a few days of testing, I came to the following conclusion:

  1. 但是CPSubsystem至少需要三个模块才能开始工作,两个实例可以正常运行
  2. 在我介绍的最具灾难性的情况下(仅运行一个实例),没有什么可做的,您的环境可能正在闲逛,可能需要某种干预或关注需要解决这种干扰
  1. However CPSubsystem demands at least three modules to start working, it is fine to have two instances running
  2. In the most catastrophic possible scenario I presented (having just one instance running), there is nothing much to do, your environment probably is having a ruff time, some kind of intervention or attention will be needed to solve this interruption

在这种情况下,为了保持模块之间所有操作的一致性,我决定阻止要完成的请求.

I decided to prevent the request to be fulfilled in the case this scenario happens to keep consistency of all operations between modules.

此决定是通过阅读大量材料(这里这里此处

This decision was made reading a lot of material (here, here, here, here, here and also simulating the scenario over here).

因此,方法如下:

try {
    if( !hz.isCpInstanceAvailable() ) {
        throw new HazelcastUnavailableException("CPSubsystem is not available");
    }
    ... acquires the lock ...
} catch (HazelcastUnavailableException e) {
    LOG.error("Error retrieving Hazelcast Distributed Lock :( Please check the CPSubsystem health among all instances", e);
    throw e;
}

方法isCpInstanceAvailable将执行三个验证:

  1. 如果当前应用程序已在CPSubsystem
  2. 上注册
  3. 如果CPSubsystem已启动
  4. CPSubsystem
  5. 中是否有最少的成员
  1. If the current application is registered on the CPSubsystem
  2. If the CPSubsystem is up
  3. If there a minimum of members available in the CPSubsystem

这是解决方案:

protected boolean isCpInstanceAvailable() {
    try {
        return getCPLocalMember() != null && getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS).size() > ONE_MEMBER;
    } catch (InterruptedException | ExecutionException | TimeoutException e) {
        LOG.error("Issue retrieving CP Members", e);
    }

    return false;
}

protected ICompletableFuture<Collection<CPMember>> getCPMembers() {
    return Optional.ofNullable(getCPSubsystemManagementService().getCPMembers()).orElseThrow(
            () -> new HazelcastUnavailableException("CP Members not available"));
}

protected CPMember getCPLocalMember() {
    return getCPSubsystemManagementService().getLocalCPMember();
}

问题来了,简单地调用getCPMembers().get()会导致我遇到长时间的停顿(默认超时).

Here comes the issue, simply calling getCPMembers().get() would cause the long pause I was experiencing (default timeout).

所以我使用了getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS),如果调用超过了预期的超时,它将抛出一个错误.

So I used the getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS), which will throw an error if the call exceeds the expected timeout.

这篇关于配置Hazelcast CPSubsystem重试超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆