Cassandra 超时 cqlsh 查询大(ish)数据量 [英] Cassandra timeout cqlsh query large(ish) amount of data

查看:18
本文介绍了Cassandra 超时 cqlsh 查询大(ish)数据量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个涉及构建和查询 Cassandra 数据集群的学生项目.

I'm doing a student project involving building and querying a Cassandra data cluster.

当我的集群负载很轻(大约 30GB)时,我的查询运行没有问题,但现在它相当大(1/2TB)我的查询超时.

When my cluster load was light ( around 30GB ) my queries ran without a problem, but now that it's quite a bit bigger (1/2TB) my queries are timing out.

我认为这个问题可能会出现,所以在我开始生成和加载测试数据之前,我在我的 cassandra.yaml 文件中更改了这个值:

I thought that this problem might arise, so before I began generating and loading test data I had changed this value in my cassandra.yaml file:

request_timeout_in_ms(默认值:10000)其他杂项操作的默认超时时间.

request_timeout_in_ms (Default: 10000 ) The default timeout for other, miscellaneous operations.

然而,当我将该值更改为 1000000 时,cassandra 似乎在启动时挂起——但这可能只是工作中的大超时.

However, when I changed that value to like 1000000, then cassandra seemingly hung on startup -- but that could've just been the large timeout at work.

我的数据生成目标是 2TB.如何在不超时的情况下查询这么大的空间?

My goal for data generation is 2TB. How do I query that large of space without running into timeouts?

查询:

SELECT  huntpilotdn 
FROM    project.t1 
WHERE   (currentroutingreason, orignodeid, origspan,  
        origvideocap_bandwidth, datetimeorigination)
        > (1,1,1,1,1)
AND      (currentroutingreason, orignodeid, origspan,    
         origvideocap_bandwidth, datetimeorigination)
         < (1000,1000,1000,1000,1000)
LIMIT 10000
ALLOW FILTERING;

SELECT  destcause_location, destipaddr
FROM    project.t2
WHERE   datetimeorigination = 110
AND     num >= 11612484378506
AND     num <= 45880092667983
LIMIT 10000;


SELECT  origdevicename, duration
FROM    project.t3
WHERE   destdevicename IN ('a','f', 'g')
LIMIT 10000
ALLOW FILTERING;

我有一个具有相同架构的演示键空间,但数据大小(~10GB)要小得多,并且这些查询在该键空间中运行得很好.

I have a demo keyspace with the same schemas, but a far smaller data size (~10GB) and these queries run just fine in that keyspace.

所有这些被查询的表都有数百万行,每行大约有 30 列.

All these tables that are queried have millions of rows and around 30 columns in each row.

推荐答案

我猜你也在使用二级索引.您正在直接了解为什么不推荐使用二级索引查询和 ALLOW FILTERING 查询......因为这些类型的设计模式不能针对大型数据集进行扩展.使用支持主键查找的查询表重建您的模型,这就是 Cassandra 的工作方式.

I'm going to guess that you are also using secondary indexes. You are finding out firsthand why secondary index queries and ALLOW FILTERING queries are not recommended...because those type of design patterns do not scale for large datasets. Rebuild your model with query tables that support primary key lookups, as that is how Cassandra is designed to work.

编辑

被约束的变量是簇键."

"The variables that are constrained are cluster keys."

对...这意味着它们不是分区键.在不限制分区键的情况下,您基本上是在扫描整个表,因为集群键仅在其分区键内有效(集群数据).

Right...which means they are not partition keys. Without constraining your partition key(s) you are basically scanning your entire table, as clustering keys are only valid (cluster data) within their partition key.

编辑 20190731

所以虽然我可能有接受"的答案,但我可以看到这里还有三个额外的答案.他们都专注于更改查询超时,其中两个比我的答案得分高(一个高一点).

So while may I have the "accepted" answer, I can see that there are three additional answers here. They all focus on changing the query timeout, and two of them outscore my answer (one by quite a bit).

随着这个问题继续增加页面浏览量,我觉得有必要解决增加超时的问题.现在,我不打算否决任何人的答案,因为从投票的角度来看,这看起来像是酸葡萄".但我可以清楚地说明为什么我觉得这不能解决任何问题.

As this question continues to rack-up page views, I feel compelled to address the aspect of increasing the timeout. Now, I'm not about to downvote anyone's answers, as that would look like "sour grapes" from a vote perspective. But I can articulate why I don't feel that solves anything.

首先,查询完全超时,这是一个症状;这不是主要问题.因此,增加查询超时只是一个临时解决方案,掩盖了主要问题.

First, the fact that the query times-out at all, is a symptom; it's not the main problem. Therefore increasing the query timeout is simply a bandaid solution, obscuring the main problem.

当然,主要问题是 OP 试图强制集群支持与底层数据模型不匹配的查询.只要忽略此问题并采取变通办法(而不是直接处理),此问题就会继续显现.

The main problem of course being, that the OP is trying to force the cluster to support a query that does not match the underlying data model. As long as this problem is ignored and subject to work-arounds (instead of being dealt with directly) this problem will continue to manifest itself.

其次,看看 OP 实际尝试做什么:

Secondly, look at what the OP is actually trying to do:

我的数据生成目标是 2TB.如何在不超时的情况下查询这么大的空间?

My goal for data generation is 2TB. How do I query that large of space without running into timeouts?

那些查询超时限制用于保护您的集群.如果您要通过 2TB 的数据运行全表扫描(这意味着对 Cassandra 的全集群扫描),则超时阈值将非常大.事实上,如果您确实设法找到了允许这样做的正确数字,那么您的协调器节点将在大多数数据组装到结果集中之前倾倒 LONG.

Those query timeout limits are there to protect your cluster. If you were to run a full-table scan (which means full-cluster scan to Cassandra) through 2TB of data, that timeout threshold would be quite large. In fact, if you did manage to find the right number to allow that, your coordinator node would tip over LONG before most of the data was assembled in the result set.

总而言之,增加查询超时:

In summary, increasing query timeouts:

  1. 通过强迫 Cassandra 违背其设计方式来提供帮助"的外观.

  1. Gives the appearance of "helping" by forcing Cassandra to work against how it was designed.

可能会使节点崩溃,从而危及底层集群的稳定性.

Can potentially crash a node, putting the stability of the underlying cluster at risk.

因此,增加查询超时是一个可怕的可怕的想法.

Therefore, increasing the query timeouts is a terrible, TERRIBLE IDEA.

这篇关于Cassandra 超时 cqlsh 查询大(ish)数据量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆