Cassandra群集-特定节点-特定表高丢弃突变 [英] Cassandra Cluster - Specific Node - specific table high Dropped Mutations

查看:376
本文介绍了Cassandra群集-特定节点-特定表高丢弃突变的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在生产中的压缩策略是LZ4压缩.但是我将其修改为Deflate

My Compression strategy in Production was LZ4 Compression. But I modified it to Deflate

对于压缩更改,我们必须使用nodetool Upgradesstables来强制升级所有sstables上的压缩策略

For compression change, we had to use nodetool Upgradesstables to forcefully upgrade the compression strategy on all sstables

但是一旦在集群中的所有5个节点上完成了upgradesstabloes命令,我的请求就开始失败,无论是读取还是写入

But once upgradesstabloes command completed on all the 5 nodes in the cluster, My requests started to fail, both read and write

该问题追溯到5个节点群集中的特定节点,并且 到该节点上的特定表.我的整个集群大致相同 大量的数据和配置,但特别是1个节点出现故障 行为不正常

The issue is traced to a specific node out of the 5 node cluster and to a spcific table on that node. My whole cluster has roughly same amount of data and configuration , but 1 node in particular goes down is misbehaving

nodetool status

|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  xx.xxx.xx.xxx  283.94 GiB  256          40.4%             24950207-5fbc-4ea6-92aa-d09f37e83a1c  rack1
UN  xx.xxx.xx.xxx  280.55 GiB  256          39.9%             4ecdf7f8-a4d8-4a94-a930-1a87a80ae510  rack1
UN  xx.xxx.xx.xxx  284.61 GiB  256          40.5%             de2ada08-264b-421a-961f-5fd113f28208  rack1
UN  YY.YYY.YY.YYY  280.44 GiB  256          40.2%             68c7c130-6cf8-4864-bde8-1819f238045c  rack2
UN  xx.xxx.xx.xxx  273.71 GiB  256          39.0%             6c080e47-ffb2-4fbc-bc7e-73df19103d2a  rack2

YY.YYY.YY.YYY上方的节点有错误

集群配置

  • 复制因子-> 2
    • 读取一致性-> 1
    • 写一致性-> 1
    • 仅供参考,我还使用了轻量级交易Cassandra版本3.10
      • Replication Factor -> 2
        • Read Consistency -> 1
        • Write Consistency -> 1
        • FYI, I am also using lightweight transaction Cassandra Version 3.10
        • Nodetool tablestats显示高丢失的突变

                          SSTable count: 11
                          Space used (live): 9.82 GiB
                          Space used (total): 9.82 GiB
                          Space used by snapshots (total): 0 bytes
                          Off heap memory used (total): 26.77 MiB
                          SSTable Compression Ratio: 0.1840953951763564
                          Number of keys (estimate): 15448921
                          Memtable cell count: 8558
                          Memtable data size: 5.89 MiB
                          Memtable off heap memory used: 0 bytes
                          Memtable switch count: 5
                          Local read count: 67792
                          Local read latency: 92.314 ms
                          Local write count: 31336
                          Local write latency: 0.067 ms
                          Pending flushes: 0
                          Percent repaired: 21.18
                          Bloom filter false positives: 1
                          Bloom filter false ratio: 0.00794
                          Bloom filter space used: 22.2 MiB
                          Bloom filter off heap memory used: 18.45 MiB
                          Index summary off heap memory used: 3.24 MiB
                          Compression metadata off heap memory used: 5.08 MiB
                          Compacted partition minimum bytes: 87
                          Compacted partition maximum bytes: 943127
                          Compacted partition mean bytes: 3058
                          Average live cells per slice (last five minutes): 1.0
                          Maximum live cells per slice (last five minutes): 1
                          Average tombstones per slice (last five minutes): 1.0
                          Maximum tombstones per slice (last five minutes): 1
                          Dropped Mutations: 4.13 KiB
          

          nodetool info显示

          Gossip active          : true
          Thrift active          : false
          Native Transport active: true
          Load                   : 280.43 GiB
          Generation No          : 1514537104
          Uptime (seconds)       : 8810363
          Heap Memory (MB)       : 1252.06 / 3970.00
          Off Heap Memory (MB)   : 573.33
          Data Center            : dc1
          Rack                   : rack1
          Exceptions             : 18987
          Key Cache              : entries 351612, size 99.86 MiB, capacity 100 MiB, 11144584 hits, 21126425 requests, 0.528 recent hit rate, 14400 save period in seconds
          

          在5个节点中,特定节点的丢弃突变数约为560Kb"很高,即使该节点与另一个节点具有相同的配置并拥有相等的数据量,也可以读取.

          Out of 5 Nodes , a specific node has a high no of Dropped Mutation "Around 560Kb" and Reads even though that node has same configuration as the other and owns equal amount of data.

          我们曾尝试修复该节点,但这并没有降低掉掉的突变,并且请求一直失败.

          We had tried to repair that node but That did not bring down the dropped mutation and the request kept failing.

          我们在该节点上重新启动了cassandra服务,但删除的突变仍然持续增加

          We restarted the cassandra service on that node but the dropped mutation still kept on increasing

          System.logs

          System.logs

          ERROR [ReadRepairStage:10229] 2018-04-11 16:02:12,954 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10229,5,main]
          org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
              at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
              at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
              at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
          ERROR [ReadRepairStage:10231] 2018-04-11 16:02:17,551 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10231,5,main]
          org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
              at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
              at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
              at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
          ERROR [ReadRepairStage:10232] 2018-04-11 16:02:22,221 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10232,5,main]
          org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
              at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
              at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
              at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
          

          Debug.Logs

          Debug.Logs

          DEBUG [ReadRepairStage:161301] 2018-04-11 01:45:01,432 DataResolver.java:169 - Timeout while read-repairing after receiving all 1 data and digest responses
          ERROR [ReadRepairStage:161301] 2018-04-11 01:45:01,432 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:161301,5,main]
          org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
              at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
              at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
              at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
          DEBUG [ReadRepairStage:161304] 2018-04-11 01:45:02,692 ReadCallback.java:242 - Digest mismatch:
          org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(-4042387324575455696, 229229902e5a43588d52466b8063b557) (d41d8cd98f00b204e9800998ecf8427e vs 4662dce3dcb05114ed670fbc40291d53)
              at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.10.jar:3.10]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_144]
              at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
              at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
          DEBUG [GossipStage:1] 2018-04-11 01:45:02,958 FailureDetector.java:457 - Ignoring interval time of 2000158817 for /xx.xxx.xx.xxx
          WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-04-11 01:45:04,665 NoSpamLogger.java:94 - Out of 1 commit log syncs over the past 0.00s with average duration of 180655.05ms, 1 have exceeded the configured commit interval by an average of 170655.05ms
          DEBUG [ReadRepairStage:161303] 2018-04-11 01:45:04,693 DataResolver.java:169 - Timeout while read-repairing after receiving all 1 data and digest responses
          ERROR [ReadRepairStage:161303] 2018-04-11 01:45:04,709 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:161303,5,main]
          org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
              at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
              at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
              at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
              at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
          INFO  [ScheduledTasks:1] 2018-04-11 01:45:07,353 MessagingService.java:1214 - MUTATION messages were dropped in last 5000 ms: 87 internal and 77 cross node. Mean internal dropped latency: 89509 ms and Mean cross-node dropped latency: 95871 ms
          INFO  [ScheduledTasks:1] 2018-04-11 01:45:07,354 MessagingService.java:1214 - HINT messages were dropped in last 5000 ms: 0 internal and 93 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 86440 ms
          INFO  [ScheduledTasks:1] 2018-04-11 01:45:07,354 MessagingService.java:1214 - READ_REPAIR messages were dropped in last 5000 ms: 0 internal and 72 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 73159 ms
          

          希望任何人都可以帮助我.

          Hope anyone can help me with this.

          更新:

          Nodetool info将此节点的堆大小更新为9GB之后.

          Nodetool info after updated heap size to 9GB for this node.

          ID                     : 68c7c130-6cf8-4864-bde8-1819f238045c
          Gossip active          : true
          Thrift active          : false
          Native Transport active: true
          Load                   : 279.32 GiB
          Generation No          : 1523504294
          Uptime (seconds)       : 9918
          Heap Memory (MB)       : 5856.73 / 9136.00
          Off Heap Memory (MB)   : 569.67
          Data Center            : dc1
          Rack                   : rack2
          Exceptions             : 862
          Key Cache              : entries 3650, size 294.83 KiB, capacity 100 MiB, 8112 hits, 22015 requests, 0.368 recent hit rate, 14400 save period in seconds
          Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
          Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
          Chunk Cache            : entries 7680, size 480 MiB, capacity 480 MiB, 1282773 misses, 1292444 requests, 0.007 recent hit rate, 3797.874 microseconds miss latency
          Percent Repaired       : 6.190888093280888%
          Token                  : (invoke with -T/--tokens to see all 256 tokens)
          

          推荐答案

          我们自己遇到了此问题,并通过从节点中删除节点来解决了 (作为最后的手段) 集群(我们感到有些未知的硬件故障或此类内存泄漏)

          We faced this issue ourselves and we resolved this (as last resort) by removing the node from the cluster ( We belived there was some unknown hardware failure or memory leak of that sort )

          我们建议您使用nodetool removenode而不是nodetool decomission删除该节点,因为我们不想从故障节点而是从其中一个副本中流式传输数据. (这是一个安全的检查,以避免将损坏的数据流传输到其他节点的可能性.)

          We recommend you remove the node using nodetool removenode instead of nodetool decomission because we do not want to stream data from the failed node but instead from one of it's replica. ( This was a safe check and to avoid possibility of streaming corrupt data to other nodes. )

          删除节点后,群集运行状况恢复正常,并且运行正常.

          After we removed the node , the cluster health came back to normal and it was functioning normally.

          这篇关于Cassandra群集-特定节点-特定表高丢弃突变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆