Cassandra错误消息:由于本地暂停,未将节点标记为关闭.为什么? [英] Cassandra Error message: Not marking nodes down due to local pause. Why?
问题描述
我使用datastax拥有6个节点,1个Solr,5个Spark节点.我的集群位于与Amazon EC2类似的服务器上,具有EBS卷.每个节点具有3个EBS卷,这些卷使用LVM组成逻辑数据磁盘.在我的OPS中心中,同一节点经常无响应,这导致我的数据系统的连接超时.我的数据量约为400GB,包含3个副本.我每分钟有20个具有批处理间隔的流式作业.这是我的错误消息:
I have 6 nodes, 1 Solr, 5 Spark nodes, using datastax. My cluster is on a similar server to Amazon EC2, with EBS volume. Each node has 3 EBS volumes, which compose a logical data disk using LVM. In my OPS center the same node frequently becomes unresponsive, which leads to a connect time out of my data system. My data amount is around 400GB with 3 replicas. I have 20 streaming jobs with batch interval every minute. Here is my error message:
/var/log/cassandra/output.log:WARN 13:44:31,868 Not marking nodes down due to local pause of 53690474502 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:40:34,944 FailureDetector.java:258 - Not marking nodes down due to local pause of 64532052919 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:59:12,023 FailureDetector.java:258 - Not marking nodes down due to local pause of 66027485893 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-26 13:44:31,868 FailureDetector.java:258 - Not marking nodes down due to local pause of 53690474502 > 5000000000
这些是我更具体的配置.我想知道我是否做错了什么,如果是的话,如何详细了解它是什么以及如何解决它?
These are my more specific configurations. I would like to know wether I am doing something wrong and if so how can I find out in details what it is and how to fix it?
出堆设置为
MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="4G"
当前堆:
[root@iZ11xsiompxZ ~]# jstat -gc 11399
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
0.0 196608.0 0.0 196608.0 6717440.0 2015232.0 43417600.0 23029174.0 69604.0 68678.2 0.0 0.0 1041 131.437 0 0.000 131.437
[root@iZ11xsiompxZ ~]# jmap -heap 11399
Attaching to process ID 11399, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.102-b14
using thread-local object allocation.
Garbage-First (G1) GC with 23 thread(s)
堆配置:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 51539607552 (49152.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 30920409088 (29488.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 16777216 (16.0MB)
堆的使用情况:
G1 Heap:
regions = 3072
capacity = 51539607552 (49152.0MB)
used = 29923661848 (28537.427757263184MB)
free = 21615945704 (20614.572242736816MB)
58.059545404588185% used
G1 Young Generation:
Eden Space:
regions = 366
capacity = 6878658560 (6560.0MB)
used = 6140461056 (5856.0MB)
free = 738197504 (704.0MB)
89.26829268292683% used
Survivor Space:
regions = 12
capacity = 201326592 (192.0MB)
used = 201326592 (192.0MB)
free = 0 (0.0MB)
100.0% used
G1 Old Generation:
regions = 1443
capacity = 44459622400 (42400.0MB)
used = 23581874200 (22489.427757263184MB)
free = 20877748200 (19910.572242736816MB)
53.04110320109241% used
40076 interned Strings occupying 7467880 bytes.
我不知道为什么会这样.非常感谢.
I don't know why this happens. Thanks a lot.
推荐答案
您看到的消息由于本地暂停而未将节点标记为空
是由于JVM暂停.尽管您在这里通过发布JVM信息来做一些好事,但通常一个不错的起点就是查看/var/log/cassandra/system.log
例如检查诸如 ERROR
, WARN
.还可以通过grepping GCInspector
来检查GC事件的长度和频率.
The message you see Not marking nodes down due to local pause
is due to the JVM pausing. Although you're doing some good things here by posting JVM information, often a good place to start is just looking at the /var/log/cassandra/system.log
for example check for things such as ERROR
, WARN
. Also check for length and frequency of GC events by grepping for GCInspector
.
诸如 nodetool tpstats
之类的工具在这里是您的朋友,以查看您是否备份或删除了突变,阻止了刷新编写器等.
Tools such as nodetool tpstats
are your friend here, seeing if you have backed up or dropped mutations, blocked flush writers and such.
这里的文档需要检查一些好东西: https://docs.datastax.com/zh-CN/landing_page/doc/landing_page/troubleshooting/cassandra/cassandraTrblTOC.html
Docs here have some good things to check for: https://docs.datastax.com/en/landing_page/doc/landing_page/troubleshooting/cassandra/cassandraTrblTOC.html
还要检查您的节点是否具有建议的生产设置,这通常被忽略:
Also check your nodes have the recommended production settings, this is something often overlooked:
http://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettingsLinux.html
还需要注意的一件事是,Cassandra对I/O相当敏感,正常" EBS可能不够快,无法满足您在此所需的需求.把Solr也加入到混合中,当您同时执行Cassandra压缩和Lucene Merge进入磁盘时,您会看到很多I/O争用.
Also one thing to note is that Cassandra is rather i/o sensitive and "normal" EBS might not be fast enough for what you need here. Throw Solr into the mix too and you can see a lot of i/o contention when you hit a Cassandra compaction and Lucene Merge going for disk at the same time.
这篇关于Cassandra错误消息:由于本地暂停,未将节点标记为关闭.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!