Cassandra修复导致节点超时 [英] Cassandra repair cause timeout in node

查看:109
本文介绍了Cassandra修复导致节点超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用具有5个节点的Cassandra(3.10-1)群集-每个节点具有8个内核和23Gi内存,所有这些都在同一DC中。
复制因子-2
一致性级别-2

We are using Cassandra (3.10-1) cluster with 5 nodes - each with 8 cores and 23Gi memory, all in the same DC. Replication factor - 2 Consistency level - 2

在定期进行的修复(大约每周一次)中,查询的超时很多从一个节点开始-总是在不同的维修中使用相同的节点。
通过查看它的日志:

Lately during scheduled repair which is being done about once a week, there are many timeouts on queries from one node - always the same node in different repairs. From looking at it logs:

debug.log显示,构建用于修复的merkel树时存在一些错误。

debug.log shows that there are several errors when building the merkel tree for repair.

DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,314 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in user_device_validation_configuration
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,315 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/user_device_validation_configuration, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,318 RepairMessageVerbHandler.java:114 - Validating ValidationRequest{gcBefore=1545349928} org.apache.cassandra.repair.messages.ValidationRequest@5c1c2b28
DEBUG [ValidationExecutor:973] 2018-12-30 23:52:08,319 StorageService.java:3313 - Forcing flush on keyspace keyspace1, CF raw_sensors
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,319 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in raw_sensors
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,319 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/raw_sensors, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,321 RepairMessageVerbHandler.java:114 - Validating ValidationRequest{gcBefore=1545349928} org.apache.cassandra.repair.messages.ValidationRequest@5c1c2b28
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,321 RepairMessageVerbHandler.java:142 - Got anticompaction request AnticompactionRequest{parentRepairSession=b387e320-0c8d-11e9-b894-754001cf0917} org.apache.cassandra.repair.messages.AnticompactionRequest@d4b7ed7b
ERROR [AntiEntropyStage:1] 2018-12-30 23:52:08,322 RepairMessageVerbHandler.java:168 - Got error, removing parent repair session
ERROR [AntiEntropyStage:1] 2018-12-30 23:52:08,322 CassandraDaemon.java:229 - Exception in thread Thread[AntiEntropyStage:1,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Parent repair session with id = b387e320-0c8d-11e9-b894-754001cf0917 has failed.
    at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:171) ~[apache-cassandra-3.10.jar:3.10]
    at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-3.10.jar:3.10]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_131]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
    at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
Caused by: java.lang.RuntimeException: Parent repair session with id = b387e320-0c8d-11e9-b894-754001cf0917 has failed.
    at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:400) ~[apache-cassandra-3.10.jar:3.10]
    at org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:435) ~[apache-cassandra-3.10.jar:3.10]
    at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:143) ~[apache-cassandra-3.10.jar:3.10]
    ... 7 common frames omitted
DEBUG [ValidationExecutor:973] 2018-12-30 23:52:08,323 StorageService.java:3313 - Forcing flush on keyspace keyspace1, CF mouse_interactions
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,323 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in mouse_interactions
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,327 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/mouse_interactions, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [GossipStage:1] 2018-12-30 23:52:10,643 FailureDetector.java:457 - Ignoring interval time of 2000164738 for /10.52.3.47
DEBUG [GossipStage:1] 2018-12-30 23:52:13,643 FailureDetector.java:457 - Ignoring interval time of 2000239040 for /10.52.3.47
DEBUG [ReadRepairStage:407] 2018-12-30 23:52:15,133 ReadCallback.java:242 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(7486012912397474412, 000467657474000020376337363933643363613837616231643531633936396564616234636363613400) (a0e45fcd73255bcd93a63b15d41e0843 vs 7dff1a87a755cf62150befc8352f59e8)
    at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.10.jar:3.10]
    at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.10.jar:3.10]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
    at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
DEBUG [GossipStage:1] 2018-12-30 23:52:26,639 FailureDetector.java:457 - Ignoring interval time of 2000385682 for /10.52.3.47

在GC日志上工作了几个小时之后,我注意到GC大约每20秒被调用一次,每次调用停止10秒钟以上:

After few hours at GC logs, I noticed that GC is being called roughly every 20 seconds, and stop for more than 10 seconds each call:

2018-12-31T06:24:57.450+0000: 1184437.292: Total time for which application threads were stopped: 18.0318658 seconds, Stopping threads took: 0.0005000 seconds
2018-12-31T06:24:57.483+0000: 1184437.325: Total time for which application threads were stopped: 0.0053815 seconds, Stopping threads took: 0.0007325 seconds
2018-12-31T06:24:57.565+0000: 1184437.407: Total time for which application threads were stopped: 0.0118127 seconds, Stopping threads took: 0.0057652 seconds
2018-12-31T06:24:57.604+0000: 1184437.446: Total time for which application threads were stopped: 0.0064909 seconds, Stopping threads took: 0.0023037 seconds
2018-12-31T06:24:57.701+0000: 1184437.543: Total time for which application threads were stopped: 0.0066540 seconds, Stopping threads took: 0.0031299 seconds
{Heap before GC invocations=1377084 (full 108682):
 par new generation   total 943744K, used 943711K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
  eden space 838912K, 100% used [0x00000005c0000000, 0x00000005f3340000, 0x00000005f3340000)
  from space 104832K,  99% used [0x00000005f99a0000, 0x00000005ffff7ce0, 0x0000000600000000)
  to   space 104832K,   0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
 concurrent mark-sweep generation total 7340032K, used 7340032K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 71629K, capacity 73844K, committed 75000K, reserved 1116160K
  class space    used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
2018-12-31T06:24:58.029+0000: 1184437.870: [Full GC (Allocation Failure) 2018-12-31T06:24:58.029+0000: 1184437.871: [CMS: 7340032K->7340031K(7340032K), 15.2051822 secs] 8283743K->7443230K(8283776K), [Metaspace: 71629K->71629K(1116160K)], 15.2055451 secs] [Times: user=13.94 sys=1.28, real=15.20 secs] 
Heap after GC invocations=1377085 (full 108683):
 par new generation   total 943744K, used 103198K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
  eden space 838912K,  12% used [0x00000005c0000000, 0x00000005c64c7950, 0x00000005f3340000)
  from space 104832K,   0% used [0x00000005f99a0000, 0x00000005f99a0000, 0x0000000600000000)
  to   space 104832K,   0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
 concurrent mark-sweep generation total 7340032K, used 7340031K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 71629K, capacity 73844K, committed 75000K, reserved 1116160K
  class space    used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
}
2018-12-31T06:25:13.235+0000: 1184453.076: Total time for which application threads were stopped: 15.2103050 seconds, Stopping threads took: 0.0004204 seconds
2018-12-31T06:25:13.243+0000: 1184453.085: Total time for which application threads were stopped: 0.0047592 seconds, Stopping threads took: 0.0008416 seconds
2018-12-31T06:25:13.272+0000: 1184453.114: Total time for which application threads were stopped: 0.0085538 seconds, Stopping threads took: 0.0046376 seconds
2018-12-31T06:25:13.298+0000: 1184453.140: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7340031K(7340032K)] 7536074K(8283776K), 0.0650538 secs] [Times: user=0.12 sys=0.01, real=0.06 secs] 
2018-12-31T06:25:13.364+0000: 1184453.206: Total time for which application threads were stopped: 0.0728487 seconds, Stopping threads took: 0.0039520 seconds
2018-12-31T06:25:13.364+0000: 1184453.206: [CMS-concurrent-mark-start]
{Heap before GC invocations=1377085 (full 108684):
 par new generation   total 943744K, used 943215K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
  eden space 838912K, 100% used [0x00000005c0000000, 0x00000005f3340000, 0x00000005f3340000)
  from space 104832K,  99% used [0x00000005f99a0000, 0x00000005fff7bd98, 0x0000000600000000)
  to   space 104832K,   0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
 concurrent mark-sweep generation total 7340032K, used 7340031K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 71631K, capacity 73844K, committed 75000K, reserved 1116160K
  class space    used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
2018-12-31T06:25:13.821+0000: 1184453.662: [Full GC (Allocation Failure) 2018-12-31T06:25:13.821+0000: 1184453.663: [CMS2018-12-31T06:25:16.960+0000: 1184456.802: [CMS-concurrent-mark: 3.592/3.596 secs] [Times: user=12.47 sys=0.38, real=3.60 secs]

所以我已经开始检查群集中的数据分布-我们正在使用num_of_tokens-32个
数据似乎是平衡的〜40%

So i've start checking the data spread in cluster - we are using num_of_tokens - 32 data seem to be balanced ~ 40% in each node.

UN  10.52.2.94  672.64 GiB  32           ?       ad3d1365-bbb7-4229-b586-40667ec22b41  rack1
UN  10.52.3.47  640.91 GiB  32           ?       cdba952b-9685-4769-aaf4-22e538a5c37f  rack1
UN  10.52.1.57  719.34 GiB  32           ?       13bb7573-eb30-489b-80c4-6e5a7c8e5f5e  rack1
UN  10.52.5.42  743.04 GiB  32           ?       c9e892c6-9281-4a49-b4c4-a147a387c3de  rack1
UN  10.52.4.43  691.1 GiB  32           ?       53e3724e-f5a9-46b1-b330-7bb542f15f90  rack1

所以在检查其他节点的日志后我找不到在该特定节点发生超时的任何原因。
关于导致此事件一次又一次在同一节点上发生的任何想法或想法?

So after checking the logs in other node I cant find any reason for those timeout in that specific node. Any thoughts or ideas as for what cause this to happen on the same node again and again ?

推荐答案

只在一个节点上看到这一点很奇怪。仔细检查配置是否相同。否则,您可能正在写/查询该节点主要负责的大分区。

That's really odd to only see this on one node. Double-check that the configs are the same. Otherwise, you might be writing/querying a large partition which that node is primarily responsible for.


复制因子-2一致性级别-2

Replication factor - 2 Consistency level - 2

通常,修复过程可能会导致节点难以处理请求,因为构建Merkle树和流式数据非常耗费资源。我在这里看到两个问题:

In general, repairs can cause nodes to have trouble serving requests, as building Merkle Trees and streaming data are quite resource-intensive. I see two problems here:


  1. 长时间GC暂停。

  2. RF / CL比

从#2开始,当您的RF = 2且您需要2副本以响应您的查询,您实质上是在CONSISTENCY_ALL进行查询。因此,如果节点不堪重负且资源短缺,您的查询将无法完成。如果是我,则将RF增加到3,然后进行修复(假设节点有足够的存储空间)。另一种选择是以1(ONE)的一致性级别进行查询,无论如何,这可能是您应该使用RF = 2进行的操作。

Starting with #2, when you have a RF=2 and you're requiring 2 replicas to respond on your queries, you are essentially querying at CONSISTENCY_ALL. Therefore, should a node become overwhelmed and short on resources, your queries will be unable to complete. If it were me, I would increase the RF to 3, and run a repair (assuming the nodes have adequate storage). The other option, would be to query at a consistency level of 1 (ONE), which is probably what you should be doing with a RF=2 anyway.

此外,诊断查询问题时,通常有助于查看正在运行的实际查询(您尚未提供)。对于Cassandra,查询问题通常是由于查询不符合设计表而导致的。

Also, when diagnosing query problems, it usually helps to see the actual query being run (which you have not provided). With Cassandra, more often than not, query issues are the result of queries which don't fit the designed table.

对于长时间的GC暂停,这是一个棘手的问题解决。有一张旧的Cassandra JIRA( CASSANDRA-8150 )票证谈到了最佳设置CMS垃圾收集器。

As for the long GC pauses, that's a trickier problem to solve. There's an old Cassandra JIRA (CASSANDRA-8150) ticket which talks about optimal settings for the CMS Garbage Collector. You should give that a read.

您的 MAX_HEAP 设置为什么?我看到您的新一代内存不足1GB,这太小了。每个节点上都有23GB的RAM,我将从使用CMS GC的以下设置开始:

What is your MAX_HEAP set to? I see your new generation is less than 1GB, which is going to be too small. With 23GB of RAM on each node, I'd start with the following settings for using CMS GC:


  • 最大堆大小( Xmx )的大小为8GB-12GB(您想为操作系统/页面缓存保留大约50%的RAM)。

  • 初始堆大小( Xms )等于最大堆大小,以避免对调整堆大小造成性能损失。

  • 堆新大小( Xmn )为最大堆大小的40-50%(介于3GB-6GB之间)。您希望为新一代提供足够的空间。

  • MaxTenuringThreshold 为6到8,因为您希望将对象传递给新世代的幸存者空间直到死亡,而不是被提升为旧世代。默认情况下,我认为将其设置为1,这意味着对象提升得太快。

  • Max Heap Size (Xmx) of 8GB-12GB (you want to leave about 50% of RAM for the OS/page-cache).
  • Initial Heap Size (Xms) equal to Max Heap Size to avoid the performance hit of resizing the heap.
  • Heap New Size (Xmn) of 40-50% of your Max Heap Size (so somewhere between 3GB-6GB). You want plenty of room available for the new generation.
  • MaxTenuringThreshold of 6 to 8, as you want objects to be passed around the new gen survivor spaces until they die, in lieu of being promoted to the old gen. By default, I think this is set to 1, which means objects will get promoted too quickly.

基本上,新旧/ old至永久升级是长时间停顿的地方。理想情况下,您希望堆中的所有对象在新世代中创建,生存和消亡,而却没有被提升。

Basically, new-to-old/old-to-permanent promotion is where the long pauses happen. Ideally, you'd like all objects on your heap to be created, live, and die in the new gen, without being promoted.

否则,可能值得您尝试使用G1GC。对于G1,我会使用大约50%RAM的Max / Initial Heap,400-500ms的 MaxGCPauseMillis ,并且完全不设置Heap New Size (G1计算得出)。

Otherwise, it might be worth your while to try using G1GC. For G1, I'd go with a Max/Initial Heap of about 50% of RAM, a MaxGCPauseMillis of 400-500ms, and don't set Heap New Size at all (G1 computes that).

这篇关于Cassandra修复导致节点超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆