Cassandra群集-特定节点-特定表高丢弃突变 [英] Cassandra Cluster - Specific Node - specific table high Dropped Mutations
问题描述
我在生产中的压缩策略是LZ4压缩.但是我将其修改为Deflate
My Compression strategy in Production was LZ4 Compression. But I modified it to Deflate
对于压缩更改,我们必须使用nodetool Upgradesstables来强制升级所有sstables上的压缩策略
For compression change, we had to use nodetool Upgradesstables to forcefully upgrade the compression strategy on all sstables
但是一旦在集群中的所有5个节点上完成了upgradesstabloes命令,我的请求就开始失败,无论是读取还是写入
But once upgradesstabloes command completed on all the 5 nodes in the cluster, My requests started to fail, both read and write
该问题追溯到5个节点群集中的特定节点,并且 到该节点上的特定表.我的整个集群大致相同 大量的数据和配置,但特别是1个节点出现故障 行为不正常
The issue is traced to a specific node out of the 5 node cluster and to a spcific table on that node. My whole cluster has roughly same amount of data and configuration , but 1 node in particular goes down is misbehaving
nodetool status
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN xx.xxx.xx.xxx 283.94 GiB 256 40.4% 24950207-5fbc-4ea6-92aa-d09f37e83a1c rack1
UN xx.xxx.xx.xxx 280.55 GiB 256 39.9% 4ecdf7f8-a4d8-4a94-a930-1a87a80ae510 rack1
UN xx.xxx.xx.xxx 284.61 GiB 256 40.5% de2ada08-264b-421a-961f-5fd113f28208 rack1
UN YY.YYY.YY.YYY 280.44 GiB 256 40.2% 68c7c130-6cf8-4864-bde8-1819f238045c rack2
UN xx.xxx.xx.xxx 273.71 GiB 256 39.0% 6c080e47-ffb2-4fbc-bc7e-73df19103d2a rack2
YY.YYY.YY.YYY
上方的节点有错误
集群配置
- 复制因子-> 2
- 读取一致性-> 1
- 写一致性-> 1
- 仅供参考,我还使用了轻量级交易Cassandra版本3.10
- Replication Factor -> 2
- Read Consistency -> 1
- Write Consistency -> 1
- FYI, I am also using lightweight transaction Cassandra Version 3.10
Nodetool tablestats
显示高丢失的突变SSTable count: 11 Space used (live): 9.82 GiB Space used (total): 9.82 GiB Space used by snapshots (total): 0 bytes Off heap memory used (total): 26.77 MiB SSTable Compression Ratio: 0.1840953951763564 Number of keys (estimate): 15448921 Memtable cell count: 8558 Memtable data size: 5.89 MiB Memtable off heap memory used: 0 bytes Memtable switch count: 5 Local read count: 67792 Local read latency: 92.314 ms Local write count: 31336 Local write latency: 0.067 ms Pending flushes: 0 Percent repaired: 21.18 Bloom filter false positives: 1 Bloom filter false ratio: 0.00794 Bloom filter space used: 22.2 MiB Bloom filter off heap memory used: 18.45 MiB Index summary off heap memory used: 3.24 MiB Compression metadata off heap memory used: 5.08 MiB Compacted partition minimum bytes: 87 Compacted partition maximum bytes: 943127 Compacted partition mean bytes: 3058 Average live cells per slice (last five minutes): 1.0 Maximum live cells per slice (last five minutes): 1 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Dropped Mutations: 4.13 KiB
nodetool info
显示Gossip active : true Thrift active : false Native Transport active: true Load : 280.43 GiB Generation No : 1514537104 Uptime (seconds) : 8810363 Heap Memory (MB) : 1252.06 / 3970.00 Off Heap Memory (MB) : 573.33 Data Center : dc1 Rack : rack1 Exceptions : 18987 Key Cache : entries 351612, size 99.86 MiB, capacity 100 MiB, 11144584 hits, 21126425 requests, 0.528 recent hit rate, 14400 save period in seconds
在5个节点中,特定节点的丢弃突变数约为560Kb"很高,即使该节点与另一个节点具有相同的配置并拥有相等的数据量,也可以读取.
Out of 5 Nodes , a specific node has a high no of Dropped Mutation "Around 560Kb" and Reads even though that node has same configuration as the other and owns equal amount of data.
我们曾尝试修复该节点,但这并没有降低掉掉的突变,并且请求一直失败.
We had tried to repair that node but That did not bring down the dropped mutation and the request kept failing.
我们在该节点上重新启动了cassandra服务,但删除的突变仍然持续增加
We restarted the cassandra service on that node but the dropped mutation still kept on increasing
System.logs
System.logs
ERROR [ReadRepairStage:10229] 2018-04-11 16:02:12,954 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10229,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144] ERROR [ReadRepairStage:10231] 2018-04-11 16:02:17,551 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10231,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144] ERROR [ReadRepairStage:10232] 2018-04-11 16:02:22,221 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10232,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
Debug.Logs
Debug.Logs
DEBUG [ReadRepairStage:161301] 2018-04-11 01:45:01,432 DataResolver.java:169 - Timeout while read-repairing after receiving all 1 data and digest responses ERROR [ReadRepairStage:161301] 2018-04-11 01:45:01,432 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:161301,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144] DEBUG [ReadRepairStage:161304] 2018-04-11 01:45:02,692 ReadCallback.java:242 - Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(-4042387324575455696, 229229902e5a43588d52466b8063b557) (d41d8cd98f00b204e9800998ecf8427e vs 4662dce3dcb05114ed670fbc40291d53) at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_144] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144] DEBUG [GossipStage:1] 2018-04-11 01:45:02,958 FailureDetector.java:457 - Ignoring interval time of 2000158817 for /xx.xxx.xx.xxx WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-04-11 01:45:04,665 NoSpamLogger.java:94 - Out of 1 commit log syncs over the past 0.00s with average duration of 180655.05ms, 1 have exceeded the configured commit interval by an average of 170655.05ms DEBUG [ReadRepairStage:161303] 2018-04-11 01:45:04,693 DataResolver.java:169 - Timeout while read-repairing after receiving all 1 data and digest responses ERROR [ReadRepairStage:161303] 2018-04-11 01:45:04,709 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:161303,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144] INFO [ScheduledTasks:1] 2018-04-11 01:45:07,353 MessagingService.java:1214 - MUTATION messages were dropped in last 5000 ms: 87 internal and 77 cross node. Mean internal dropped latency: 89509 ms and Mean cross-node dropped latency: 95871 ms INFO [ScheduledTasks:1] 2018-04-11 01:45:07,354 MessagingService.java:1214 - HINT messages were dropped in last 5000 ms: 0 internal and 93 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 86440 ms INFO [ScheduledTasks:1] 2018-04-11 01:45:07,354 MessagingService.java:1214 - READ_REPAIR messages were dropped in last 5000 ms: 0 internal and 72 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 73159 ms
希望任何人都可以帮助我.
Hope anyone can help me with this.
更新:
Nodetool info
将此节点的堆大小更新为9GB之后.Nodetool info
after updated heap size to 9GB for this node.ID : 68c7c130-6cf8-4864-bde8-1819f238045c Gossip active : true Thrift active : false Native Transport active: true Load : 279.32 GiB Generation No : 1523504294 Uptime (seconds) : 9918 Heap Memory (MB) : 5856.73 / 9136.00 Off Heap Memory (MB) : 569.67 Data Center : dc1 Rack : rack2 Exceptions : 862 Key Cache : entries 3650, size 294.83 KiB, capacity 100 MiB, 8112 hits, 22015 requests, 0.368 recent hit rate, 14400 save period in seconds Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds Chunk Cache : entries 7680, size 480 MiB, capacity 480 MiB, 1282773 misses, 1292444 requests, 0.007 recent hit rate, 3797.874 microseconds miss latency Percent Repaired : 6.190888093280888% Token : (invoke with -T/--tokens to see all 256 tokens)
推荐答案
我们自己遇到了此问题,并通过从节点中删除节点来解决了 (作为最后的手段) 集群(我们感到有些未知的硬件故障或此类内存泄漏)
We faced this issue ourselves and we resolved this (as last resort) by removing the node from the cluster ( We belived there was some unknown hardware failure or memory leak of that sort )
我们建议您使用
nodetool removenode
而不是nodetool decomission
删除该节点,因为我们不想从故障节点而是从其中一个副本中流式传输数据. (这是一个安全的检查,以避免将损坏的数据流传输到其他节点的可能性.)We recommend you remove the node using
nodetool removenode
instead ofnodetool decomission
because we do not want to stream data from the failed node but instead from one of it's replica. ( This was a safe check and to avoid possibility of streaming corrupt data to other nodes. )删除节点后,群集运行状况恢复正常,并且运行正常.
After we removed the node , the cluster health came back to normal and it was functioning normally.
这篇关于Cassandra群集-特定节点-特定表高丢弃突变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!