Cassandra 3.10 debug.log包含频繁的“ FailureDetector.java:457-忽略...的间隔时间”。 [英] Cassandra 3.10 debug.log contains frequent "FailureDetector.java:457 - Ignoring interval time of..."

查看:193
本文介绍了Cassandra 3.10 debug.log包含频繁的“ FailureDetector.java:457-忽略...的间隔时间”。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的Cassandra 3.10集群之一的debug.log文件经常出现类似于 FailureDetector.java:457-忽略间隔时间……的消息。

The debug.log files for one of our Cassandra 3.10 clusters has frequent messages similar to "FailureDetector.java:457 - Ignoring interval time of…"

消息即使群集处于空闲状态也会出现。我在这6个节点群集的每个节点(两个数据中心中的每个3个节点)上看到消息的速率约为每秒1个。

The messages appear even if the cluster is idle. I see the messages at a rate of about 1 per second on each node of this 6 node cluster (3 nodes each in two data centers).

有人可以告诉我是什么原因导致了消息,以及是否值得关注吗?

Can someone tell me what causes the messages and if they are something to be concerned about?

其他几个支持相同应用程序(不同环境)的小型群集,并且我看到此消息的频率更低(相隔几天)。

We have a couple of other small clusters supporting the same application (different environments) and I see this message much less often (days apart).

推荐答案

FailureDetector 负责确定节点是否被认为是UP或DOWN。

The FailureDetector is responsible of deciding if a node is considered UP or DOWN.


<闲话过程直接跟踪其他节点的状态(直接向节点
闲聊),也间接跟踪其他节点的状态(节点以二手,二手方式传达了大约
的信息)。 Cassandra不是使用固定的阈值
来标记故障节点,而是使用权责发生制检测
机制来计算每个节点的阈值,该阈值考虑了
的网络性能,工作量和历史条件。在
八卦交换期间,每个节点都维护着
集群中其他节点的
八卦消息到达时间的滑动窗口。

The gossip process tracks state from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes communicated about secondhand, third-hand, and so on). Rather than have a fixed threshold for marking failing nodes, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network performance, workload, and historical conditions. During gossip exchanges, every node maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster.

此处,您可以找到源代码,该源代码为您提供了日志消息。设置为DEBUG级别是因为它们可能有助于跟踪导致延迟的实际问题,但不能自行指出问题。

Here you can find the source code, which gives you the log message. It is set to DEBUG level because they may be helpful in tracking down the actual issue causing the latency, but don't indicate a problem on their own.

换句话说:您的节点测量发送给其他节点的每个八卦消息的确认等待时间,例如: IP地址1的X纳秒,Z纳秒的IP地址2等等。如果 X Y 均超出预期的2秒阈值,如 MAX_INTERVAL_IN_NANO 中所述,它将得到报告。

In other words: your node measures the acknowledgement latency for each gossip message sent to the other nodes e.g: X nanosec for IP address1, Z nanosec for IP address2, etc. If eitherX or Y is above the expected 2 sec threshold as stated in MAX_INTERVAL_IN_NANO, it will get reported.

问题,可能会导致以下日志消息:

Problems, which can cause this log message:


  • 节点上的巨大负载:例如,太多的大分区

  • 高压:例如在一段时间内查询过多

  • 网络连接错误

额外的FailureDetector日志记录是加上以下内容:
通过JMX从故障检测器中公开phi值并进行调试
和跟踪日志记录( CASSANDRA-9526

The extra FailureDetector logging was added with this: Expose phi values from failure detector via JMX and tweak debug and trace logging (CASSANDRA-9526)

我也发现了这个未解决的问题,可能与您的问题有关:
当网络为flakey时,故障检测器将变得更加敏感( CASSANDRA-9536

and also I found this open issue, might be related to your problem: The failure detector becomes more sensitive when the network is flakey(CASSANDRA-9536)

我也找到这篇关于闲聊和故障检测的文章非常有用。

Also I find this article about Gossiping and Failure Detection very useful.

这篇关于Cassandra 3.10 debug.log包含频繁的“ FailureDetector.java:457-忽略...的间隔时间”。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆