计数列,CountQuery vs SliceQuery操作非常慢 [英] Counting columns, very slow CountQuery vs SliceQuery operations

查看:62
本文介绍了计数列,CountQuery vs SliceQuery操作非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个普查"程序来遍历Column Family中的所有行,并在每一行中对列进行计数,记录最大值和行键.我一直在花更多的时间与赫克托(Hector)客户在一起,但也写了一个Pelops客户来进行测试.

I've written a "census" program to iterate through all the rows in a Column Family and within each row count the columns, recording the max value and row key. I've been spending more time with the Hector client but have written a Pelops client as well to test.

基本流程是使用RangeSlicesQuery遍历行,然后在每一行使用SliceQuery遍历并收集统计信息.在Pelops中工作类似,只是API不同.缺点是必须手动进行缓冲,同时选择行和列的缓冲区大小...我的当前数据是1200万行,最大列数约为25K,所以是的,要花点时间...在我的当前配置中,每秒> 25,000行.

The basic flow is to use use a RangeSlicesQuery to iterate through the rows, and then at each row, use a SliceQuery to iterate through and collect the stats. Works similar in Pelops, just different APIs. Downside is having to do the buffering manually, picking buffer sizes for both rows and columns... My current data is 12 million rows, with largest column count ~25K, so yeah takes a while... in my current configuration, am getting >25K rows per second.

寻找改善和发现赫克托(Hector)的CountQuery的方法(我认为,该方法使用Thrift客户端get_count()).认为仅迭代键(使用RangeSlicesQuery.setReturnKeysOnly()),然后在每个行键上重新使用CountQuery会更快,我修改了代码.

Looking for ways to improve and discovered Hector's CountQuery (which I assume, uses Thrift client get_count()). Thinking it would be faster to just iterate keys (use RangeSlicesQuery.setReturnKeysOnly()), and then re-use a CountQuery on each row key, I revised the code.

不仅速度变慢了,而且还慢了30倍!(每秒仅处理900行)...

Not only was it slower, but 30x slower! (processed only 900 rows per second)...

是否有更好的方法来计数列?

Is there a better way to count columns?

推荐答案

不确定Hector发生了什么-我希望它慢大约2倍,而不慢30倍.

Not sure what's going on with Hector -- I'd expect it to be roughly 2x slower, not 30x slower.

更一般而言,使用计数器列保留非规范化的计数可能比完整的CF扫描更好:

More generally, keeping a denormalized count using a counter column is probably better than a full CF scan: http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters

这篇关于计数列,CountQuery vs SliceQuery操作非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆