Cassandra无限运行紧凑 - 高CPU使用率 [英] Cassandra running compact indefinitely - High CPU usage

查看:1527
本文介绍了Cassandra无限运行紧凑 - 高CPU使用率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文



我们在AWS上托管了6个Cassandra实例,分为3个不同的区域,每个区域2个西,2在us-west,2在ap-east东部)。



2天前,我们将2个EC2 Cassandra实例从us-west- -east-1。当我说移动我的意思是我们退役,并在我们的群集添加了2个新实例。



我们运行 nodetool修复它没有做任何事情,和 nodetool rebuild 它同步我们的数据从eu-west数据中心。之后,我们注意到我们的Cassandra群集上的多个实例使用了超过70%的CPU和流量。



首先,我们认为这是复制发生,但考虑到我们只有500MB的数据,并且它仍然运行,我们对正在发生的事情感到困惑。






实例硬件:



我们的所有实例都在m3.medium上运行,这意味着我们正在:




  • 1个CPU,2.5 GHz

  • 3.75 GB RAM




此外,我们还为 / var / lib / cassandra 安装了一个EBS卷实际上EBS上有6个SSD的RAID0:




  • EBS卷300GB SSD,RAID0



参考文献: Amazon实例类型






软件版本:



Cassandra版本:2.0.12






思考



分析我们的数据后,我们认为这是由Cassandra数据压缩引起的。



有一个关于同一个主题的stackoverflow问题: Cassandra压缩任务卡住了



但是,这是由去寻找一个单一的SSD(Azure高级存储 - 仍然在预览),没有RAID0配置为Cassandra,作者说,没有理由这可以解决根本的问题(为什么删除RAID0部分从公式修正这个?)。



由于AWS定价比我们现在的价格高得多,我们不希望转移到本地存储。即使,如果它真的是我们的问题的原因,我们会尝试。



另一个原因,这听起来像一个更深层的问题是,我们有数据显示,这些EBS卷在过去3天内已经写入/读取了大量数据。



由于我们移动实例,我们每秒获取大约300-400KB的写入数据每个EBS卷,所以由于我们有一个RAID0,每秒6次这个数量= 1.8-2.4MB / s。这相当于在过去3天内写入PER大约450GB的数据。我们对READ操作也有基本相同的价值。



我们现在只对他们运行测试,所以我们获得的唯一流量来自我们的CI服务器,并最终来自Gossip在实例之间进行的通信。






调试注意 / p>

输出 nodetool status

 数据中心:cassandra-eu-west-1-A 
============================= ====
状态=上/下
| /状态=正常/离开/加入/移动
- 地址加载令牌所有者主机ID机架
UN xxx.xxx。 xxx.xxx 539.5 MB 256 17.3%12341234-1234-1234-1234-12341234123412340cd7 eu-west-1c
UN xxx.xxx.xxx.xxx 539.8 MB 256 14.4%30ff8d00-1ab6-4538-9c67-a49e9ad34672 eu- west-1b
数据中心:cassandra-ap-southeast-1-A
======================== ==========
状态=上/下
| /状态=正常/离开/加入/移动
- 地址加载令牌所有者主机ID机架
UN xxx.xxx.xxx.xxx 585.13 MB 256 16.9%a0c45f3f-8479-4046-b3c0-b2dd19f07b87 ap-southeast-1a
UN xxx.xxx.xxx.xxx 588.66 MB 256 17.8%b91c5863-e1e1-4cb6 -b9c1-0f24a33b4baf ap-southeast-1b
数据中心:cassandra-us-east-1-A
====================== ===========
状态=上/下
| /状态=正常/离开/加入/移动
- 地址加载令牌拥有者主机ID机架
UN xxx.xxx.xxx.xxx 545.56 MB 256 15.2%ab049390-f5a1-49a9-bb58-b8402b0d99af us-east-1d
UN xxx.xxx.xxx.xxx 545.53 MB 256 18.3%39c698ea-2793- 4aa0-a28d-c286969febc4 us-east-1e

的输出nodetool compactionstats

 挂起任务:64 
压缩类型键空间表完成的总单位进度
Compaction staging stats_hourly 418858165 1295820033 bytes 32.32%
活动压缩剩余时间:0h00m52s

code> dstat 对不健康的实例:





图表形式的压缩历史记录(从16开始每小时平均300次):





EBS卷使用:







运行df -h:

 文件系统大小已用可用使用%mounted on 
/ dev / xvda1 33G 11G 21G 34%/
无4.0K 0 4.0K 0%/ sys / fs / cgroup
udev 1.9G 12K 1.9G 1%/ dev
tmpfs 377M 424K 377M 1%/ run
none 5.0M 0 5.0M 0%/ run / lock
无1.9G 4.0K 1.9G 1%/ run / shm
无100M 0 100M 0%/ run / user
/ dev / xvdb 3.9G 8.1M 3.7G 1%/ mnt
/ dev / md0 300G 2.5G 298G 1%/ var / lib / cassandra

运行 nodetool tpstats

 池名活动挂起完成已阻止所有时间已阻止
MutationStage 0 0 3191689 0 0
ReadStage 0 0 574633 0 0
RequestResponseStage 0 0 2698972 0 0
ReadRepairStage 0 0 2721 0 0
ReplicateOnWriteStage 0 0 0 0 0
MiscStage 0 0 62601 0 0
HintedHandoff 0 1 443 0 0
FlushWriter 0 0 88811 0 0
MemoryMeter 0 0 1472 0 0
GossipStage 0 0 979483 0 0
CacheCleanupExecutor 0 0 0 0 0
InternalResponseStage 0 0 25 0 0
CompactionExecutor 1 39 99881 0 0
ValidationExecutor 0 0 62599 0 0
MigrationStage 0 0 40 0 0
commitlog_archiver 0 0 0 0 0
AntiEntropyStage 0 0 149095 0 0
PendingRangeCalculator 0 0 23 0 0
MemtablePostFlusher 0 0 173847 0 0

消息类型已删除
READ 0
RANGE_SLICE 0
_TRACE 0
MUTATION 0
COUNTER_MUTATION 0
BINARY 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0

运行iptraf,按字节排序:



解决方案

我们尝试了一些其他答案和评论,但最终解决这个问题是终止2个新实例。



当我们尝试向我们的集群添加新实例时,它会顺利进行,负载现在恢复正常。



我的希望是 nodetool rebuild nodetool修复可能已经开始意外处理我们的两个节点。



这里是我们的eu-west实例的CPU使用率,在回收之后us-east实例:




Context

We have 6 instances of Cassandra hosted on AWS, separated into 3 different regions, 2 per regions (2 in eu-west, 2 in us-west, 2 in ap-southeast).

2 days ago, we moved 2 of our EC2 Cassandra instances from us-west-1 to us-east-1. When I say "move" I mean that we decommissioned them and added 2 new instances on our cluster.

We ran nodetool repair which didn't do anything, and nodetool rebuild which synchronized our data from the eu-west data centre. Following that change we noticed that multiple instances on our Cassandra cluster were using over 70% CPU and had incoming traffic.

At first, we thought it was the replication taking place, but considering that we only had 500MB of data, and that it is still running we are puzzled as to what is happening.


Instances hardware:

All of our instances are running on m3.medium which means that we are on:

  • 1 CPU, 2.5 GHz
  • 3.75 GB of RAM
  • 4GB SSD

Also we have mounted an EBS volume for /var/lib/cassandra which is actually a RAID0 of 6 SSDs on EBS:

  • EBS volume 300GB SSD, RAID0

Ref: Amazon Instances Types


Software Version:

Cassandra Version: 2.0.12


Thoughts:

After analysing our data we thought this was caused by Cassandra data compaction.

There is another stackoverflow question around the same subject: Cassandra compaction tasks stuck.

However, this was solved by going for a single SSD (Azure Premium Storage - still in preview) and no RAID0 configured for Cassandra, and as the author said, there is no reason for this to fix the underlying problem (why would removing the RAID0 part from the equation fix this?).

We are not keen yet to move to a local storage as AWS pricing is a lot higher than what we have now. Even though, if it really is the cause of our problem, we will try it.

Another reason why this sounds like a deeper problem is that we have data showing that these EBS volumes have been writing/reading a lot of data in the last 3 days.

Since we moved instances, we get something around 300-400KB of written data per second on each EBS volume, so since we have a RAID0, 6 times this amount per second = 1.8-2.4MB/s. This amounts to ~450GB of data written PER instance over the last 3 days. And we have basically the same value for READ operation too.

We are only running tests on them at the moment, so the only traffic we are getting is coming from our CI server and eventually from the communication that Gossip is doing between the instances.


Debug notes

Output of nodetool status:

Datacenter: cassandra-eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  xxx.xxx.xxx.xxx 539.5 MB   256     17.3%  12341234-1234-1234-1234-12341234123412340cd7  eu-west-1c
UN  xxx.xxx.xxx.xxx 539.8 MB   256     14.4%  30ff8d00-1ab6-4538-9c67-a49e9ad34672  eu-west-1b
Datacenter: cassandra-ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  xxx.xxx.xxx.xxx 585.13 MB  256     16.9%  a0c45f3f-8479-4046-b3c0-b2dd19f07b87  ap-southeast-1a
UN  xxx.xxx.xxx.xxx 588.66 MB  256     17.8%  b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf  ap-southeast-1b
Datacenter: cassandra-us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  xxx.xxx.xxx.xxx 545.56 MB  256     15.2%  ab049390-f5a1-49a9-bb58-b8402b0d99af  us-east-1d
UN  xxx.xxx.xxx.xxx 545.53 MB  256     18.3%  39c698ea-2793-4aa0-a28d-c286969febc4  us-east-1e

Output of nodetool compactionstats:

pending tasks: 64
          compaction type        keyspace           table       completed           total      unit  progress
               Compaction         staging    stats_hourly       418858165      1295820033     bytes    32.32%
Active compaction remaining time :   0h00m52s

Running dstat on unhealthy instance:

Compaction history in graph form (average of 300 times per hour starting the 16th):

EBS Volumes usage:

Running df -h:

Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       33G   11G   21G  34% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            1.9G   12K  1.9G   1% /dev
tmpfs           377M  424K  377M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            1.9G  4.0K  1.9G   1% /run/shm
none            100M     0  100M   0% /run/user
/dev/xvdb       3.9G  8.1M  3.7G   1% /mnt
/dev/md0        300G  2.5G  298G   1% /var/lib/cassandra

Running nodetool tpstats:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     0         0        3191689         0                 0
ReadStage                         0         0         574633         0                 0
RequestResponseStage              0         0        2698972         0                 0
ReadRepairStage                   0         0           2721         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
MiscStage                         0         0          62601         0                 0
HintedHandoff                     0         1            443         0                 0
FlushWriter                       0         0          88811         0                 0
MemoryMeter                       0         0           1472         0                 0
GossipStage                       0         0         979483         0                 0
CacheCleanupExecutor              0         0              0         0                 0
InternalResponseStage             0         0             25         0                 0
CompactionExecutor                1        39          99881         0                 0
ValidationExecutor                0         0          62599         0                 0
MigrationStage                    0         0             40         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropyStage                  0         0         149095         0                 0
PendingRangeCalculator            0         0             23         0                 0
MemtablePostFlusher               0         0         173847         0                 0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
MUTATION                     0
COUNTER_MUTATION             0
BINARY                       0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

Running iptraf, sorted by bytes:

解决方案

We tried a few things from other answers and comments, but what finally solved this issue was terminating the 2 new instances.

When we tried adding new instances to our cluster it went smoothly and the load is now back to normal.

My hunch is that nodetool rebuild or nodetool repair may have started unexpected processing for our two nodes. It may also be possible that these particular instances were faulty, but I have not found any evidence of it.

Here's the CPU usage on our eu-west instances after recycling the us-east instances:

这篇关于Cassandra无限运行紧凑 - 高CPU使用率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆