Hazelcast-OperationTimeoutException [英] Hazelcast - OperationTimeoutException

查看:323
本文介绍了Hazelcast-OperationTimeoutException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Hazelcast版本3.3.1.
我有一个使用c3.2xlarge服务器在AWS上运行的9节点群集. 我正在使用分布式执行程序服务和分布式映射.
分布式执行程序服务使用单个线程. 分布式映射被配置为没有复制且没有近缓存,并使用Kryo序列化程序存储了大约一百万个大小为1-2kb的对象.
我的用例如下:

I am using Hazelcast version 3.3.1.
I have a 9 node cluster running on aws using c3.2xlarge servers.
I am using a distributed executor service and a distributed map.
Distributed executor service uses a single thread. Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer.
My use case goes as follow:

  • 所有9个节点在分布式执行程序服务上不断执行同步远程操作,并每秒产生约2万次匹配(每个节点约2k次).
  • 调用使用Hazelcast API执行:com.hazelcast.core.IExecutorService#executeOnKeyOwner.
  • 每个操作都会访问拥有分区的节点上的分布式映射,并使用存储的对象进行一些计算,然后将该对象存储到映射中. (为此,我使用了IMap对象的get和set API).

每隔一段时间,Hazelcast就会遇到超时异常,例如:
com.hazelcast.core.OperationTimeoutException:120000毫秒内无响应.中止调用! BasicInvocationFuture {invocation = BasicInvocation {serviceName ='hz:impl:mapService',op = GetOperation {},partitionId = 212,replicaIndex = 0,tryCount = 250,tryPauseMillis = 500,invokeCount = 1,callTimeout = 60000,target = Address [ [172.31.44.2]:5701,backupsExpected = 0,backupsCompleted = 0},响应= null,完成= false}未收到响应! backups-expected:0备份已完成:0

Every once in a while Hazelcast encounters a timeout exceptions such as:
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0

在某些情况下,我看到地图分区开始迁移,这使情况变得更糟,节点不断离开并重新加入集群,而克服这一问题的唯一方法是重新启动整个集群.

In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.

我想知道是什么会导致Hazelcast阻止地图获取操作120秒?
我很确定这与网络无关,因为同一台服务器上的其他服务运行良好. 另外请注意,服务器大部分处于闲置状态(约70%).

I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds?
I am pretty sure it's not network related since other services on the same servers operate just fine. Also note that the servers are mostly idle (~70%).

对于我的用例的任何反馈将非常感谢.

Any feedbacks on my use case will be highly appreciated.

推荐答案

为什么不使用入口处理器?这也将自动发送给拥有分区的正确机器,并且加载,修改,存储是自动完成的.因此,没有种族问题.由于涉及的远程处理较少,因此它可能会大大优于当前方法.

Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.

map.get不会在120秒内返回的事实确实非常令人困惑.如果您切换到Hazelcast 3.5,我们使用慢速操作检测器(执行端)和慢速调用检测器(调用者端)为此添加了一些日志记录/调试工具,并且应该为您提供一些见识.

The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.

您是否看到正在打印的任何运行状况监视器日志?

Do you see any Health monitor logs being printed?

这篇关于Hazelcast-OperationTimeoutException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆