Cassandra“写入超时”的性质是什么? [英] What's the nature of Cassandra "write timeout"?

查看:118
本文介绍了Cassandra“写入超时”的性质是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在AWS EC2的24节点Cassandra 3.5群集上运行大量写程序(25K / sec写入时有10个线程峰值)(每个主机均为c4.2xlarge类型:8个vcore和15G ram)

I am running a write-heavy program (10 threads peaks at 25K/sec writes) on a 24 node Cassandra 3.5 cluster on AWS EC2 (each host is of c4.2xlarge type: 8 vcore and 15G ram)

我的Java客户端每隔一段时间使用DataStax驱动程序3.0.2,都会遇到写入超时的问题:

Every once in a while my Java client, using DataStax driver 3.0.2, would get write timeout issue:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency TWO (2 replica were required but only 1 acknowledged the write)
    at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:73)
    at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:26)
    at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
    at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
    at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)

该错误很少发生,并且以非常不可预测的方式发生。到目前为止,我无法将故障链接到任何特定的内容(例如,程序运行时间,磁盘上的数据大小,一天中的时间,系统负载的指标,例如CPU,内存,网络指标),但这确实破坏了我们操作。

The error happens infrequently and in a very unpredictable way. So far, I am not able to link the failures to anything specific (e.g. program running time, data size on disk, time of the day, indicators of system load such as CPU, memory, network metrics) Nonetheless, it is really disrupting our operations.

我正在尝试查找问题的根本原因。在网上寻找选项,我对所有线索都有些不知所措,例如

I am trying to find the root cause of the issue. Looking online for options, I am a bit overwhelmed by all the leads out there, such as


  • 在 cassandra.yaml中更改 write_request_timeout_in_ms (已更改为5秒)

  • 使用适当的 RetryPolicy使会话继续进行(已在一个会话级一致性级别上使用DowngradingConsistencyRetryPolicy)

  • 更改缓存大小,堆大小等-从未尝试过使用这些b / c,有充分的理由将其打折为根本原因。

在我的研究过程中,确实让我感到困惑的是,我从一个完全复制的群集中收到了这个错误,而该客户端几乎没有ClientRequest.timeout.write事件:

One thing is really confusing during my research is that I am getting this error from a fully replicated cluster with very few ClientRequest.timeout.write events:


  • 我有一个完全复制的24个节点群集,跨5个aws区域。每个区域至少有2个数据副本

  • 我的程序在会话级别运行一致性级别ONE(带有QueryOption的集群生成器)

  • 出现错误时碰巧,我们的Graphite图表记录了不超过三(3)次主机打ic,即具有Cassandra.ClientRequest.Write.Timeouts.Count值

  • 我已经将write_timeout设置为5秒。该网络相当快(使用iperf3进行验证)并且稳定

  • I have a fully-replicated 24 node cluster spans 5 aws regions. Each region has at least 2 copies of the data
  • My program runs consistency level ONE at Session level (Cluster builder with QueryOption)
  • When the error happened, our Graphite chart registered no more than three (3) host hiccups, i.e. having the Cassandra.ClientRequest.Write.Timeouts.Count values
  • I already set write_timeout to 5 seconds. The network is pretty fast (using iperf3 to verify) and stable

从表面上看,这种情况应该完全在Cassandra的故障保护范围内。但是为什么我的程序仍然失败?数字不是它们看上去的样子吗?

On paper, the situation should be well within Cassandra's failsafe range. But why my program still failed? Are the numbers not what they appear to be?

推荐答案

看到超时或错误不一定总是一件坏事,特别是如果您以更高的一致性级别进行写操作,这些写操作仍然可以通过。

Its not always necessarily a bad thing to see timeouts or errors especially if you're writing at a higher consistency level, the writes may still get through.

我看到您提到了 CL = ONE 您仍然可以在这里获得超时,但是仍然可以进行写入(更改)通过。我发现此博客非常有用: https://www.datastax .com / dev / blog / cassandra-error-handling-done-right 。在发生错误时检查您的服务器端(节点)日志,看看是否有诸如ERROR / WARN / GC暂停之类的消息(例如上述注释之一),此类事件可能导致节点无响应,因此超时或其他类型的错误。

I see you mention CL=ONE you could still get timeouts here but the write (mutation) still have got through. I found this blog really useful: https://www.datastax.com/dev/blog/cassandra-error-handling-done-right. Check your server side (node) logs at the time of the error to see if you have things like ERROR / WARN / GC pauses (like one of the comments mentions above) these kind of events can cause the node to be unresponsive and therefor a timeout or other type of error.

如果更新(是理想的)是幂等的,则可以构建某种重试机制。

If your updates are idempotent (ideally) then you can build in some retry mechanism.

这篇关于Cassandra“写入超时”的性质是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆