如何在cassandra中构造范围查询? [英] how to construct range query in cassandra?

查看:147
本文介绍了如何在cassandra中构造范围查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

CREATE TABLE users ( 
userID uuid, 
firstname text, 
lastname text, 
state text, 
zip int,
age int,
PRIMARY KEY (userID) 
);

我要构造以下查询:

select * from users where age between 30 and 40

select * from users where state in "AZ" AND "WA"

我知道我还需要两个表来执行此查询,但是我不知道应该怎么办?

I know I need two more tables to do this query but I dont know how the should be?

编辑

根据Carlo的评论,我认为这是唯一的可能性

From Carlo's comments, I see this is the only possibility

CREATE TABLE users ( 
userID uuid, 
firstname text, 
lastname text, 
state text, 
zip int,
age int,
PRIMARY KEY (age,zip,userID) 
);

现在选择年龄在15到30岁之间的用户。这是唯一的可能性:

Now to select Users with age between 15 and 30. this is the only possibility:

select * from users where age IN (15,16,17,....30)

但是,不建议在此处使用IN运算符,这是反模式。

However, using IN operator here is not recommended and is anti-pattern.

如何关于创建年龄的二级索引?

How about creating secondary Index on age?

CREATE index users_age ON users(age)

会有帮助吗?

谢谢

推荐答案

范围查询是一个非常棘手的问题。
执行真实范围查询的方法是使用复合主键,使范围位于聚类部分。由于范围是在集群部分,因此您无法执行编写的查询:您至少需要对整个分区键具有相同的条件
我们来看一个例子:

Range queries is a prikly question. The way to perform a real range query is to use a compound primary key, making the range on the clustering part. Since the range is on clustering part you can't perform the queries you wrote: you need at least to have an equal condition on the whole partition key. Let's see an example:

CREATE TABLE users (
  mainland text,
  state text,
  uid int,
  name text,
  zip int,
  PRIMARY KEY ((mainland), state, uid)
) 

该uid现在是一个整数,只是为了简化测试

The uid is now an int just to make tests easier

insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'washington', 1, 'john', 98100);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'texas', 2, 'lukas', 75000);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 3, 'henry', 19904);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 4, 'dawson', 19910);
insert into users (mainland, state, uid, name, zip) VALUES ( 'centraleurope', 'italy', 5, 'fabio', 20150);
insert into users (mainland, state, uid, name, zip) VALUES ( 'southamerica', 'argentina', 6, 'alex', 10840);

现在查询可以执行所需的操作:

Now the query can perform what you need:

 select * from users where mainland = 'northamerica' and state > 'ca' and state < 'ny';

输出

 mainland    | state    | uid | name   | zip
-------------+----------+-----+--------+-------
northamerica | delaware |   3 |  henry | 19904
northamerica | delaware |   4 | dawson | 19910

如果您将int(年龄,邮政编码)作为聚类键的第一列,则可以执行相同的查询比较整数。

if you put an int (age, zipcode) as first column of the clustering key you can perform the same queries comparing integers.

照料:大多数人在看到这种情况时开始思考好吧,我可以放置一个假分区始终相同的键,然后我就可以执行范围查询。这是一个巨大的错误,分区键负责跨节点的数据分发。设置固定分区键意味着所有数据将在同一节点(及其副本中)完成。

TAKE CARE: most of people when looking at this situation starts thinking "ok, I can put a fake partition key that is always the same and then I can perform range queries". This is a huge error, the partition key is responsible for data distribution accross nodes. Setting a fix partition key means that all data will finish in the same node (and in its replica).

将世界区域划分为15/20个区域(为了具有15/20分区键)是什么,但还不够,仅创建一个有效的示例即可。

Dividing the world zone into 15/20 zones (in order to have 15/20 partition key) is something but is not enough and is made just to create a valid example.



编辑:由于问题的编辑


due to question's edit

我并不是说这是唯一的可能性;如果您找不到有效的方法来划分您的用户,并且需要执行这种查询,则这是一种可能性,而不是唯一的可能性。 范围查询应在群集关键部分上执行。 AGE作为分区键的一个弱点是您无法对其执行UPDATE,无论何时您需要更新用户的年龄时,都必须执行删除和插入操作(另一种方法可能是写birth_year / birth_date而不是年龄,然后计算客户端)

I did not say that this is the only possibility; if you can't find a valid way to partition your users and need to perform this kind of query this is one possibility, not the only one. Range queries should be performed on clustering key portion. A weak point of the AGE as partition key is that you can't perform an UPDATE over it, anytime you need to update the user's age you have to perform a delete and an insert (an alternative could be writing the birth_year/birth_date and not the age, and then calculate client side)

回答有关添加二级索引的问题:二级索引上的查询实际上不支持IN运算符。从CQL消息看来,他们似乎很快就要开发它了。

To answer your question on adding a secondary index: actually queries on secondary index does not support IN operator. From the CQL message it looks like they're going to develop it soon


错误的请求:非主键列上的IN谓词( xxx)尚不支持

Bad Request: IN predicates on non-primary-key columns (xxx) is not yet supported

但是,即使二级索引支持IN运算符,您的查询也不会从

However even if secondary index would support IN operator your query wouldn't change from

select * from users where age IN (15,16,17,....30)

只是为了阐明我的概念:没有干净和就绪解决方案的任何事情都需要用户进行建模满足其需求的数据。举个例子(我并不是说这是一个很好的解决方案:我不会使用它)

Just to clarify my concept: anything that does not have a "clean" and "ready" solution requires the effort of the user to model data in a way that satisfy its needs. To make an example (I don't say this is a good solution: I would not use it)

CREATE TABLE users (
  years_range text,
  age int,
  uid int,
  PRIMARY KEY ((years_range), age, uid)
)

输入一些数据

insert into users (years_range, age , uid) VALUES ( '11_15', 14, 1);
insert into users (years_range, age , uid) VALUES ( '26_30', 28, 3);
insert into users (years_range, age , uid) VALUES ( '16_20', 16, 2);
insert into users (years_range, age , uid) VALUES ( '26_30', 29, 4);
insert into users (years_range, age , uid) VALUES ( '41_45', 41, 5);
insert into users (years_range, age , uid) VALUES ( '21_25', 23, 5);

查询数据

select * from users where years_range in('11_15', '16_20', '21_25', '26_30') and age > 14 and age < 29;

输出

 years_range | age | uid
-------------+-----+-----
       16_20 |  16 |   2
       21_25 |  23 |   5
       26_30 |  28 |   3

此解决方案可能会解决您的问题,并且可以在小型集群中使用,其中大约有20个密钥( 0_5 ... 106_110)可能具有良好的分布。但是,此解决方案与以前的解决方案一样,不允许进行UPDATE并减少了密钥的分配。优点是您的IN集很小。

This solution might solve your problem and could be used in a small cluster, where about 20 keys (0_5 ...106_110) might have a good distribution. But this solution, like the one before, does not allow an UPDATE and reduces the distribution of key. The advantage is that you have small IN sets.

在一个理想的世界中,SI已经允许IN子句,所以我会使用UUID作为分区键,years_range(设置为birth_year_range)作为SI并过滤我的数据客户端(如果对10> age> 22感兴趣,我会要求输入 IN('1991_1995','1996_2000','2001_2005','2006_2010',' 2011_2015')计算并删除我的申请中的无用年限)

In a perfect world where S.I. already allows IN clause I'd use the UUID as partition key, the years_range (set as birth_year_range) as S.I. and "filter" my data client side (if interested in 10 > age > 22 I would ask for IN('1991_1995', '1996_2000', '2001_2005', '2006_2010', '2011_2015') calculating and removing unuseful years on my application)

HTH,
Carlo

HTH, Carlo

这篇关于如何在cassandra中构造范围查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆