Cassandra超时cqlsh查询大量数据 [英] Cassandra timeout cqlsh query large(ish) amount of data

查看:344
本文介绍了Cassandra超时cqlsh查询大量数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个学生项目,涉及构建和查询Cassandra数据集群。



当我的集群负载很轻(大约30GB)时,我的查询运行没有问题,但是现在更大了(1 / 2TB),我的查询超时了



我认为可能会出现此问题,因此在开始生成和加载测试数据之前,我已经在cassandra.yaml文件中更改了此值:


request_timeout_in_ms
(默认值:10000)其他杂项操作的默认超时时间。


但是,当我将该值更改为1000000时,cassandra似乎在启动时挂起了-但这可能只是工作中的大超时。



我的数据生成目标是2TB。



查询:

  SELECT huntpilotdn 
从project.t1
在哪里(当前路由原因,orignodeid,origspan,
origvideocap_bandwidth,datetimeorigination)
> (1,1,1,1,1)
AND(当前路由原因,orignodeid,origspan,
origvideocap_bandwidth,datetimeorigination)
< (1000,1000,1000,1000,1000)
限额10000
允许过滤;

选择destcause_location,destipaddr
从project.t2
WHERE datetimeorigination = 110
AND num> = 11612484378506
AND num< = 45880092667983
LIMIT 10000;


选择origdevicename,持续时间
FROM project.t3
destdevicename IN('a','f','g')
LIMIT 10000
允许过滤;

我有一个具有相同模式的演示键空间,但是数据量小得多(〜10GB),并且这些查询在该键空间中运行得很好。



所有这些查询的表都有数百万行,每行约30列。

解决方案

我猜您也在使用二级索引。您正在直接发现为什么不建议使用二级索引查询和ALLOW FILTERING查询的原因...因为这些类型的设计模式无法针对大型数据集进行扩展。使用支持主键查找的查询表重建模型,这就是Cassandra的工作方式。



编辑



受约束的变量是集群键。



对...这意味着它们不是分区键。在不限制分区键的情况下,您基本上可以扫描整个表,因为群集键仅在其分区键中有效(群集数据)。



编辑20190731



虽然我可以得到已接受的答案,但我可以看到这里还有三个答案。他们都专注于更改查询超时,其中有两个优于我的答案(一个相当多)。



此问题继续引起页面浏览量的增加,我感到不得不解决增加超时的问题。现在,我不会投票任何人的答案,因为从投票的角度看,这看起来像是酸葡萄。但是我可以说出为什么我不认为这可以解决任何问题。



首先,查询完全超时的事实是症状;这不是主要问题。因此,增加查询超时只是一个临时解决方案,掩盖了主要问题。



当然,主要问题在于,OP试图强制集群支持A。与基础数据模型不匹配的查询。只要忽略此问题并采取变通办法(而不是直接解决),该问题就会继续表现出来。



第二,看一下OP实际上正在尝试做:


我的数据生成目标是2TB。如何查询这么大的空间而不会发生超时?


这些查询超时限制可以用来保护 您的集群。如果要通过2TB数据运行全表扫描(这意味着对Cassandra进行全集群扫描),则超时阈值将非常大。实际上,如果您确实设法找到了允许的数字,那么您的协调节点将在大多数数据组合到结果集中之前对 LONG 进行提示。



总而言之,查询超时会增加:


  1. 的形式显示帮助


  2. 可能会迫使Cassandra反对其设计。


  3. 可能会破坏节点,从而使基础群集的稳定性受到威胁。


因此,增加查询超时是一个可怕的可怕的想法。


I'm doing a student project involving building and querying a Cassandra data cluster.

When my cluster load was light ( around 30GB ) my queries ran without a problem, but now that it's quite a bit bigger (1/2TB) my queries are timing out.

I thought that this problem might arise, so before I began generating and loading test data I had changed this value in my cassandra.yaml file:

request_timeout_in_ms (Default: 10000 ) The default timeout for other, miscellaneous operations.

However, when I changed that value to like 1000000, then cassandra seemingly hung on startup -- but that could've just been the large timeout at work.

My goal for data generation is 2TB. How do I query that large of space without running into timeouts?

queries :

SELECT  huntpilotdn 
FROM    project.t1 
WHERE   (currentroutingreason, orignodeid, origspan,  
        origvideocap_bandwidth, datetimeorigination)
        > (1,1,1,1,1)
AND      (currentroutingreason, orignodeid, origspan,    
         origvideocap_bandwidth, datetimeorigination)
         < (1000,1000,1000,1000,1000)
LIMIT 10000
ALLOW FILTERING;

SELECT  destcause_location, destipaddr
FROM    project.t2
WHERE   datetimeorigination = 110
AND     num >= 11612484378506
AND     num <= 45880092667983
LIMIT 10000;


SELECT  origdevicename, duration
FROM    project.t3
WHERE   destdevicename IN ('a','f', 'g')
LIMIT 10000
ALLOW FILTERING;

I have a demo keyspace with the same schemas, but a far smaller data size (~10GB) and these queries run just fine in that keyspace.

All these tables that are queried have millions of rows and around 30 columns in each row.

解决方案

I'm going to guess that you are also using secondary indexes. You are finding out firsthand why secondary index queries and ALLOW FILTERING queries are not recommended...because those type of design patterns do not scale for large datasets. Rebuild your model with query tables that support primary key lookups, as that is how Cassandra is designed to work.

Edit

"The variables that are constrained are cluster keys."

Right...which means they are not partition keys. Without constraining your partition key(s) you are basically scanning your entire table, as clustering keys are only valid (cluster data) within their partition key.

Edit 20190731

So while may I have the "accepted" answer, I can see that there are three additional answers here. They all focus on changing the query timeout, and two of them outscore my answer (one by quite a bit).

As this question continues to rack-up page views, I feel compelled to address the aspect of increasing the timeout. Now, I'm not about to downvote anyone's answers, as that would look like "sour grapes" from a vote perspective. But I can articulate why I don't feel that solves anything.

First, the fact that the query times-out at all, is a symptom; it's not the main problem. Therefore increasing the query timeout is simply a bandaid solution, obscuring the main problem.

The main problem of course being, that the OP is trying to force the cluster to support a query that does not match the underlying data model. As long as this problem is ignored and subject to work-arounds (instead of being dealt with directly) this problem will continue to manifest itself.

Secondly, look at what the OP is actually trying to do:

My goal for data generation is 2TB. How do I query that large of space without running into timeouts?

Those query timeout limits are there to protect your cluster. If you were to run a full-table scan (which means full-cluster scan to Cassandra) through 2TB of data, that timeout threshold would be quite large. In fact, if you did manage to find the right number to allow that, your coordinator node would tip over LONG before most of the data was assembled in the result set.

In summary, increasing query timeouts:

  1. Gives the appearance of "helping" by forcing Cassandra to work against how it was designed.

  2. Can potentially crash a node, putting the stability of the underlying cluster at risk.

Therefore, increasing the query timeouts is a terrible, TERRIBLE IDEA.

这篇关于Cassandra超时cqlsh查询大量数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆