Cassandra性能为长行 [英] Cassandra performance for long rows

查看:136
本文介绍了Cassandra性能为长行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑在Cassandra中实现一个CF,它有很长的行(每行数十万到数百万列)。

I'm looking at implementing a CF in Cassandra that has very long rows (hundreds of thousands to millions of columns per row).

使用完全伪数据,我已经插入了200万列到单行(均匀间隔)。如果我做一个切片操作来获得20列,那么我注意到一个巨大的性能下降,因为你在切片操作进一步下行。

Using entirely dummy data, I've inserted 2 million columns into a single row (evenly spaced). If I do a slice operation to get 20 columns, then I'm noticing a massive performance degradation as you do your slice operation further down the row.

列,我似乎能够提供10-40毫秒的切片结果,但是当你走到行的末尾,性能击中墙,响应时间逐渐从3800000年的43ms增加到214ms在1,900,000和435ms 1,999,900! (所有切片宽度相等)。

With most of the columns, I seem to be able to serve up slice results in 10-40ms, but as you get towards the end of the row, performance hits the wall, with response times gradually increasing from 43ms at the 1,800,000 mark to 214ms at 1,900,000 and 435ms at 1,999,900! (All slices are of equal width).

我有点失落,解释为什么在到达行尾时,性能会大大降低。有人可以提供一些指导,Cassandra在内部做什么这样的延迟?行缓存已关闭,几乎一切都是默认的Cassandra 1.0安装。

I'm at a loss to explain why there is this massive degradation in performance as you get to the end of the row. Can someone please provide some guidance as to what Cassandra's doing internally to make such a delay? Row caching is turned off and pretty much everything is a default Cassandra 1.0 installation.

它应该能够支持每行20亿列,提高性能意味着在实际情况下不能用于很长的行。

It's supposed to be able to support up to 2 billion columns per row, but at this rate of increase performance will mean that it can't be used for very long rows in a practical situation.

非常感谢。

Caveat,我在一个时间点击这个与10个请求,这就是为什么他们比我预期的慢一点,但它是一个公平的测试,所有的请求,甚至只是做他们所有在连续在1,800,000和1,900,000条记录之间有这种奇怪的降级。

Caveat, I'm hitting this with 10 requests in parallel at a time which is why they are a bit slower than I'd expect anyway, but it's a fair test across all requests and even just doing them all in serial there is this strange degradation between the 1,800,000th and 1,900,000th record.

我也注意到,当只对一个物品做反向切片时,只有200,000每行的列数:
query.setRange(end,start,false,1);

I've also noticed EXTREMELY bad performance when doing reverse slices for just a single item when having just 200,000 columns per row: query.setRange(end, start, false, 1);

推荐答案

psanford的评论到答案。事实证明,Cassandra< 1.1.0(目前处于测试阶段)在Memtables中的长行(尚未刷新到磁盘)上的切片上性能较慢,但是SSTables上的更好性能使用相同的数据刷新到磁盘。

psanford's comment led me to the answer. It turns out that Cassandra <1.1.0 (currently in beta) has slow performance on slices on long rows in Memtables (that have not been flushed to disk) but better performance on SSTables flushed to disk with the same data.

请参阅 http://mail-archives.apache.org/mod_mbox/cassandra-user/201201.mbox/%3CCAA_K6YvZ=vd=Bjk6BaEg41_r1gfjFaa63uNSXQKxgeB-oq2e5A@mail.gmail.com% 3E https://issues.apache.org/jira/browse/CASSANDRA -3545

在我的例子中,第一个180万行已刷新到磁盘,所以这个范围内的切片是快的,但最后〜200,000行没有刷新到磁盘,仍然在memtables。因为长行的memtables切片很慢,这就是为什么我在行的末尾看到错误的性能(我的数据是按列顺序插入的)。

With my example, the first 1.8 million rows had been flushed to disk, so slices over that range were fast, but the last ~200,000 rows hadn't been flushed to disk and were still in memtables. As the memtables slicing is slow on long rows, this is why I saw bad performance at the end of the rows (my data was inserted in column order).

这可以通过手动调用cassandra节点上的flush来固定。修补程序已应用于1.1.0来解决这个问题,我可以确认这解决了我的问题。

This can be fixed by manually calling a flush on the cassandra nodes. A patch has been applied to 1.1.0 to fix this and I can confirm that this fixes the issue for me.

我希望这可以帮助有同样问题的任何人。

I hope this helps anyone else with the same problem.

这篇关于Cassandra性能为长行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆