Cassandra - 重叠数据范围 [英] Cassandra - Overlapping Data Ranges
问题描述
我在Cassandra中有以下'任务'表。
I have the following 'Tasks' table in Cassandra.
- Task_ID UUID - 分区键
- Starts_On TIMESTAMP-群集列
- Ends_On TIMESTAMP - 群集列
我想运行CQL查询以获取指定日期范围内的重叠任务。例如,如果我将两个时间戳(T1和T2)作为参数传递给查询,我想获得适用于该范围内的所有任务(即重叠记录)。
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
在Cassandra中最好的方法是什么?我不能在这里使用Starts_On和Ends_On两个范围,因为要向Ends_On添加一个范围查询,我必须有一个相等的检查Starts_On。
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
推荐答案
这里有另一个想法(有点不常见)。您可以创建一个用户定义的函数来实现第二个范围过滤器(在Cassandra 2.2和更新版本中)。
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
假设你像这样定义你的表保持示例简单):
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
现在我们创建一个用户定义的函数来检查基于结束时间返回的行,并返回task_id匹配行,像这样:
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
现在我在使用第三个参数。在一个明显的(主要的)监督中,似乎你不能传递一个常量到用户定义的函数。所以为了解决这个问题,我们传递一个静态列(end_range)作为我们的常量。
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
所以首先我们要设置end_range:
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
假设我们有这些数据:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
现在让我们得到task_id的开始> = 2, end <= 15:
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
这样就给你匹配的task_id,空行(我还没有想出一个方法来删除行使用UDF的)。你会注意到,开始> = 2的过滤器在传递给UDF之前已经删除了一行。
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
无论如何不是一个完美的方法,可以工作。 :)
Anyway not a perfect method obviously, but it might be something you can work with. :)
这篇关于Cassandra - 重叠数据范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!