有没有办法有效地计算Cassandra中一个非常大的分区的行数? [英] Is there a way to effectively count rows of a very huge partition in Cassandra?

查看:143
本文介绍了有没有办法有效地计算Cassandra中一个非常大的分区的行数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常庞大的Cassandra表,其中包含超过10亿条记录。我的主键格式如下: (partition_id,cluster_id1,cluster_id2)。现在,对于几个特定的​​partition_id,我有太多的记录,无法在没有引发超时异常的情况下对这些分区键运行行计数。

I have very huge Cassandra table containing over 1 billion records. My primary key forms like this: "(partition_id, cluster_id1, cluster_id2)". Now for several particular partition_id, I have too many records that I can't run row count on these partition keys without timeout exception raised.

我在cqlsh中运行的是:

What I ran in cqlsh is:

SELECT count(*)从关系WHERE partition_id ='some_huge_partition';

我遇到了此异常:


ReadTimeout:服务器错误:代码= 1200 [协调器节点等待超时副本节点的响应] message =操作超时-仅收到0个响应。 info = {'received_responses':0,'required_responses':1,'consistency':'ONE'}

ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

我尝试设置-连接超时-请求超时,不走运。我在ElasticSearch中计算了相同的数据,行数大约为3000万(相同的分区)。

I tried to set --connect-timeout and --request-timeout, no luck. I counted same data in ElasticSearch, the row count is approximately 30 million (the same partition).

我的Cassandra是3.11.2,CQLSH是5.0.1。 Cassandra群集包含3个节点,每个节点都有1T HDD(相当陈旧的服务器,已有8年以上)。

My Cassandra is 3.11.2 and CQLSH is 5.0.1. The Cassandra cluster contains 3 nodes and each has more 1T HDD(fairly old servers, more than 8 years).

因此,简而言之,我的问题是:

So in short, my questions are:


  1. 如何计算?

  2. 是否可以在分区过滤器中使用带有分区键的COPY TO命令,以便可以在导出的CSV文件中对其进行计数?
  3. >
  4. 有没有办法在任何分区变得太大之前监视插入过程?

  1. How can I count it? is it even possible to count a huge partition in Cassandra?
  2. Can I use COPY TO command with partition key as it's filter, so I can count it in the exported CSV file?
  3. Is there a way I can monitor the insert process before any partition getting too huge?

非常感谢高级。

推荐答案

是的,使用Cassandra很难处理大型分区。尽管Cassandra会警告您不要在您的 system.log 中写大型分区,但实际上并没有一种监视特定分区大小的好方法。未绑定分区的增长是您在创建表时需要解决的问题,它涉及添加一个额外的(通常是基于时间的)分区键,该键是从了解您的业务用例中得出的。

Yes, working with large partitions is difficult with Cassandra. There really isn't a good way to monitor particular partition sizes, although Cassandra will warn about writing large partitions in your system.log. Unbound partition growth is something you need to address during the creation of your table, and it involves adding an additional (usually time based) partition key derived from understanding your business use case.

这里的答案是,您也许可以使用 COPY 命令将分区中的数据导出。为了避免超时,您需要使用 PAGESIZE PAGETIMEOUT 选项,如下所示:

The answer here, is that you may be able to export the data in the partition using the COPY command. To keep it from timing out, you'll want to use the PAGESIZE and PAGETIMEOUT options, kind of like this:

COPY products TO '/home/aploetz/products.txt'
  WITH DELIMITER='|' AND HEADER=true
  AND PAGETIMEOUT=40 AND PAGESIZE=20;

这会将 products 表导出到管道分隔的文件,带有标头,一次的页面大小为20行,每个页面提取的超时时间为40秒。

That will export the products table to a pipe-delimited file, with a header, at a page size of 20 rows at a time and with a 40 second timeout for each page fetch.

如果仍然超时,请尝试减少 PAGESIZE 和/或增加 PAGETIMEOUT

If you still get timeouts, try decreasing PAGESIZE and/or increasing PAGETIMEOUT.

这篇关于有没有办法有效地计算Cassandra中一个非常大的分区的行数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆