Cassandra压缩任务卡住了 [英] Cassandra compaction tasks stuck

查看:298
本文介绍了Cassandra压缩任务卡住了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在由3个节点组成的集群中运行Datastax Enterprise。它们都在相同的硬件下运行:2个核心英特尔至强2.2 Ghz,7 GB RAM,4 TB Raid-0



这应该足以运行集群轻负载,存储少于1 GB的数据。



大多数时候,一切都很好,但似乎有时运行的任务与OpsCenter中的修复服务有时会卡住;这会导致该节点不稳定和负载增加。



但是,如果节点重新启动,卡住的任务不会显示,负载正常



由于我们在集群中没有太多数据,因此我们使用 min_repair_time opscenterd.conf 中定义的c>参数以延迟修复服务,使其不会太频繁地完成。



看起来有点奇怪,说那些被标记为完成并显示100%的进度的任务不会消失,是的,我们已经等了好几个小时,他们走了,但他们赢了;我们发现要解决这个问题的唯一方法是重新启动节点。







编辑



c $ c> nodetool compactionstats





编辑2:

编辑3:



这是在正常运行的节点上 dstat 的输出





这是从 dstat


$ b



编辑4:



输出 iostat 在节点上使用紧缩压缩,查看高iowait



解决方案

所以,这已经是一个问题,已经被调查了很长时间了,我们已经找到一个解决方案,但是,我们不知道什么是垫底的问题,导致的问题,但我们得到一个线索,即使是,没有什么可以确认。



基本上,我们做的是设置一个RAID-0,也称为条带,包括四个磁盘,每个大小为1 TB。在使用Stripe时,我们应该已经看到了4x一个磁盘的IOPS,但是我们没有,所以RAID的设置显然有问题。



我们使用了多个实用程序确认CPU正在等待IO响应大多数时间,当我们自己说,节点被卡住。显然,IO和最可能的是我们的RAID设置导致这一点。我们尝试了MDADM设置等的一些差异,但没有设法解决使用RAID设置的问题。



我们开始调查Azure Premium存储在预览中)。这使得能够将磁盘附加到其底层物理存储实际上是SSD的VM。所以我们说,好吧,SSDs =>更多的IOPS,所以让我们试试。我们没有使用SSD设置任何RAID。我们每个虚拟机只使用一个SSD磁盘。



我们已经运行了集群了将近3天,我们已经压力测试了很多,但避风港无法重现这些问题。



我想我们没有来到真正的原因,但结论是以下一些必须是底层




  • 太慢的磁盘(写入> IOPS)

  • RAID设置不正确这导致磁盘非正常运行



这两个问题是相辅相成的,很可能是我们基本上只是以错误的方式设置磁盘。然而,SSDs =更多的人民的力量,所以我们一定会继续使用SSD。



如果有人遇到我们在大磁盘上使用RAID-0时在Azure上遇到的相同问题,请随时添加到此处。


I'm running Datastax Enterprise in a cluster consisting of 3 nodes. They are all running under the same hardware: 2 Core Intel Xeon 2.2 Ghz, 7 GB RAM, 4 TB Raid-0

This should be enough for running a cluster with a light load, storing less than 1 GB of data.

Most of the time, everything is just fine but it appears that sometimes the running tasks related to the Repair Service in OpsCenter sometimes get stuck; this causes an instability in that node and an increase in load.

However, if the node is restarted, the stuck tasks don't show up and the load is at normal levels again.

Because of the fact that we don't have much data in our cluster we're using the min_repair_time parameter defined in opscenterd.conf to delay the repair service so that it doesn't complete too often.

It really seems a little bit weird that the tasks that says that are marked as "Complete" and are showing a progress of 100% don't go away, and yes, we've waited hours for them to go away but they won't; the only way that we've found to solve this is to restart the nodes.

Edit:

Here's the output from nodetool compactionstats

Edit 2:

I'm running under Datastax Enterprise v. 4.6.0 with Cassandra v. 2.0.11.83

Edit 3:

This is output from dstat on a node that behaving normally

This is output from dstat on a node with stucked compaction

Edit 4:

Output from iostat on node with stucked compaction, see the high "iowait"

解决方案

So, this has been an issue that have been under investigation for a long time now and we've found a solution, however, we aren't sure what the underlaying problem that were causing the issues were but we got a clue even tho that, nothing can be confirmed.

Basically what we did was setting up a RAID-0 also known as Striping consisting of four disks, each at 1 TB of size. We should have seen somewhere 4x one disks IOPS when using the Stripe, but we didn't, so something was clearly wrong with the setup of the RAID.

We used multiple utilities to confirm that the CPU were waiting for the IO to respond most of the time when we said to ourselves that the node was "stucked". Clearly something with the IO and most probably our RAID-setup was causing this. We tried a few differences within MDADM-settings etc, but didn't manage to solve the problems using the RAID-setup.

We started investigating Azure Premium Storage (which still is in preview). This enables attaching disks to VMs whose underlaying physical storage actually are SSDs. So we said, well, SSDs => more IOPS, so let us give this a try. We did not setup any RAID using the SSDs. We are only using one single SSD-disk per VM.

We've been running the Cluster for almost 3 days now and we've stress tested it a lot but haven't been able to reproduce the issues.

I guess we didn't came down to the real cause but the conclusion is that some of the following must have been the underlaying cause for our problems.

  • Too slow disks (writes > IOPS)
  • RAID was setup incorrectly which caused the disks to function non-normally

These two problems go hand-in-hand and most likely is that we basically just was setting up the disks in the wrong way. However, SSDs = more power to the people, so we will definitely continue using SSDs.

If someone experience the same problems that we had on Azure with RAID-0 on large disks, don't hesitate to add to here.

这篇关于Cassandra压缩任务卡住了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆