没有CPU使用情况的Cassandra超时 [英] Cassandra Timeouts with No CPU Usage

查看:212
本文介绍了没有CPU使用情况的Cassandra超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带Datastax Cassandra驱动程序的Phantom-DSL来获取Cassandra超时。但是,Cassandra似乎并未过载。以下是我得到的异常:

  com.datastax.driver.core.exceptions.OperationTimedOutException:[node-0.cassandra。 dev / 10.0.1.137:9042]在com.datastax.driver.core.RequestHandler $ SpeculativeExecution.onTimeout(RequestHandler.java:766)
的com.datastax.driver.core.com上等待超时。 core.Connection $ ResponseHandler $ 1.run(Connection.java:1267)
在io.netty.util.HashedWheelTimer $ HashedWheelTimeout.expire(HashedWheelTimer.java:588)
在io.netty.util.HashedWheelTimer $ HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:662)
在io.netty.util.HashedWheelTimer $ Worker.run(HashedWheelTimer.java:385)在java.lang.Thread.run(Thread.java)的
:745)

这是我在这段时间内从Cassandra Datadog连接器获得的统计信息: / p>



您可以在顶部居中的图表上看到我们的读取速率(每秒)。我们的CPU和内存使用率非常低。



这是我们配置Datastax驱动程序的方式:

  val points = ContactPoints(config.cassandraHosts)
.withClusterBuilder(_。withSocketOptions(
new SocketOptions()
.setReadTimeoutMillis(config.cassandraNodeTimeout)
))
.withClusterBuilder(_。withPoolingOptions(
新的PoolingOptions()
.setConnectionsPerHost(
HostDistance.LOCAL,
2,
2

.setConnectionsPerHost(
HostDistance.REMOTE,
2,
2

.setMaxRequestsPerConnection(
HostDistance.LOCAL,
2048

.setMaxRequestsPerConnection(
HostDistance.REMOTE,
2048

.setPoolTimeoutMillis(10000)
.setNewConnectionThreshold(
HostDistance.LOCAL,
1500

.setNewConnectionThreshold(
HostDistance.REMOTE,
1500


))

我们的 nodetool cfstats 看起来像这样:

  $ nodetool cfstats alexandria_dev.match_sums 
键空间:alexandria_dev
读取计数:101892
读取延迟:0.007479115141522397 ms。
写入计数:18721
写入延迟:0.012341060840767052 ms。
待刷新:0
表:match_sums
SS表计数:0
已使用空间(活动):0
已使用空间(总计):0
空间快照使用的(总计):0
已用的堆内存不足(总计):0
SSTable压缩率:0.0
密钥数(估计):15328
内存表单元数: 15332
Memtable数据大小:21477107
已使用的堆外内存:0
Memtable开关计数:0
本地读取计数:17959
本地读取延迟:0.015 ms
本地写计数:15332
本地写延迟:0.013 ms
待刷新:0
修复百分比:100.0
Bloom过滤器误报:0
Bloom过滤器虚假比率:0.00000
使用的Bloom筛选器空间:0
使用的Bloom筛选器关闭堆内存:0
使用索引的摘要关闭堆内存:0
使用压缩的元数据关闭堆内存:0
压缩分区的最小字节:0
压缩分区的最大字节数:0
压缩分区的平均字节数:0
每个切片的平均活动单元数(最近五分钟):1.0
每个切片的最大活动单元数(最后五个)分钟):1
每片平均墓碑(最近五分钟):1.0
每片最大墓碑(最近五分钟):1
掉落的变异:0

当我们运行 cassandra-stress 时,我们没有遇到任何问题:

Cassandra每当我进行查询时,都会获得每秒稳定的5万次读取,

  INFO [Native-Transport-Requests-2] 2017-03-10 23:59:38,003 Message.java:611-请求期间发生意外异常;频道= [id:0x65d7a0cd,L:/10.0.1.98:9042! R:/10.0.1.126:35536] 
io.netty.channel.unix.Errors $ NativeIoException:syscall:read(...)()失败:对等
在io.netty重置了连接。 channel.unix.FileDescriptor.readAddress(...)(未知来源)〜[netty-all-4.0.39.Final.jar:4.0.39.Final]

为什么要超时?



编辑:我的仪表板错误已上传。请参阅新图像。

解决方案

我建议跟踪有问题的查询以了解cassandra在做什么。



https:// docs .datastax.com / en / cql / 3.1 / cql / cql_reference / tracing_r.html



打开cql shell,键入 TRACING ON 并执行查询。如果一切似乎都很好,那么有可能偶尔会出现此问题,在这种情况下,我建议您使用nodetool settraceprobablilty跟踪查询一段时间,直到您设法抓住问题。



您可以使用 nodetool settraceprobability< param> 在每个节点上分别启用它,其中param是查询将被跟踪的概率(介于0和1之间)。小心:这会导致负载增加,所以从一个很小的数字开始就上升。



如果偶尔出现此问题,则很可能是由长时间垃圾回收,在这种情况下,您需要分析GC日志。检查您的GC有多长时间。



编辑:请注意,如果此问题是由GC引起的,则无法通过跟踪看到它。因此,请先检查您的GC,如果不是问题,请继续进行跟踪。


I am getting Cassandra timeouts using the Phantom-DSL with the Datastax Cassandra driver. However, Cassandra does not seem to be overloaded. Below is the exception I get:

com.datastax.driver.core.exceptions.OperationTimedOutException: [node-0.cassandra.dev/10.0.1.137:9042] Timed out waiting for server response
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:766)
    at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1267)
    at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:588)
    at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:662)
    at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:385)
    at java.lang.Thread.run(Thread.java:745)

And here are the statistics I get from the Cassandra Datadog connector over this time period:

You can see our read rate (per second) on the top-center graph. Our CPU and memory usage are very low.

Here is how we are configuring the Datastax driver:

val points = ContactPoints(config.cassandraHosts)
  .withClusterBuilder(_.withSocketOptions(
    new SocketOptions()
      .setReadTimeoutMillis(config.cassandraNodeTimeout)
  ))
  .withClusterBuilder(_.withPoolingOptions(
    new PoolingOptions()
      .setConnectionsPerHost(
        HostDistance.LOCAL,
        2,
        2
      )
      .setConnectionsPerHost(
        HostDistance.REMOTE,
        2,
        2
      )
      .setMaxRequestsPerConnection(
        HostDistance.LOCAL,
        2048
      )
      .setMaxRequestsPerConnection(
        HostDistance.REMOTE,
        2048
      )
      .setPoolTimeoutMillis(10000)
      .setNewConnectionThreshold(
        HostDistance.LOCAL,
        1500
      )
      .setNewConnectionThreshold(
        HostDistance.REMOTE,
        1500
      )

))

Our nodetool cfstats looks like this:

$ nodetool cfstats alexandria_dev.match_sums
Keyspace : alexandria_dev
        Read Count: 101892
        Read Latency: 0.007479115141522397 ms.
        Write Count: 18721
        Write Latency: 0.012341060840767052 ms.
        Pending Flushes: 0
                Table: match_sums
                SSTable count: 0
                Space used (live): 0
                Space used (total): 0
                Space used by snapshots (total): 0
                Off heap memory used (total): 0
                SSTable Compression Ratio: 0.0
                Number of keys (estimate): 15328
                Memtable cell count: 15332
                Memtable data size: 21477107
                Memtable off heap memory used: 0
                Memtable switch count: 0
                Local read count: 17959
                Local read latency: 0.015 ms
                Local write count: 15332
                Local write latency: 0.013 ms
                Pending flushes: 0
                Percent repaired: 100.0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 0
                Bloom filter off heap memory used: 0
                Index summary off heap memory used: 0
                Compression metadata off heap memory used: 0
                Compacted partition minimum bytes: 0
                Compacted partition maximum bytes: 0
                Compacted partition mean bytes: 0
                Average live cells per slice (last five minutes): 1.0
                Maximum live cells per slice (last five minutes): 1
                Average tombstones per slice (last five minutes): 1.0
                Maximum tombstones per slice (last five minutes): 1
                Dropped Mutations: 0

When we ran cassandra-stress, we didn't experience any issues: we were getting a steady 50k reads per second, as expected.

Cassandra has this error whenever I make my queries:

INFO  [Native-Transport-Requests-2] 2017-03-10 23:59:38,003 Message.java:611 - Unexpected exception during request; channel = [id: 0x65d7a0cd, L:/10.0.1.98:9042 ! R:/10.0.1.126:35536]
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer
        at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source) ~[netty-all-4.0.39.Final.jar:4.0.39.Final]

Why are we getting timeouts?

EDIT: I had the wrong dashboard uploaded. Please see the new image.

解决方案

I suggest tracing the problematic query to see what cassandra was doing.

https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tracing_r.html

Open cql shell, type TRACING ON and execute your query. If everything seems fine, there is a chance that this problem happens occasionally, in which case I'd suggest tracing the queries using nodetool settraceprobablilty for some time, until you manage to catch the problem.

You enable it on each node separately using nodetool settraceprobability <param> where param is the probability (between 0 and 1) that the query will get traced. Careful: this WILL cause increased load, so start with a very low number and go up.

If this problem is occasional there is a chance that this might be caused by long garbage collections, in which case you need to analyse the GC logs. Check how long your GC's are.

edit: just to be clear, if this problem is caused by GC's you will NOT see it with tracing. So first check your GC's, and if its not the problem then move on to tracing.

这篇关于没有CPU使用情况的Cassandra超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆