Cassandra - 重叠数据范围 [英] Cassandra - Overlapping Data Ranges

查看:123
本文介绍了Cassandra - 重叠数据范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Cassandra中有以下'任务'表。

I have the following 'Tasks' table in Cassandra.


  • Task_ID UUID - 分区键

  • Starts_On TIMESTAMP-群集列

  • Ends_On TIMESTAMP - 群集列

我想运行CQL查询以获取指定日期范围内的重叠任务。例如,如果我将两个时间戳(T1和T2)作为参数传递给查询,我想获得适用于该范围内的所有任务(即重叠记录)。

I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).

在Cassandra中最好的方法是什么?我不能在这里使用Starts_On和Ends_On两个范围,因为要向Ends_On添加一个范围查询,我必须有一个相等的检查Starts_On。

What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.

推荐答案

这里有另一个想法(有点不常见)。您可以创建一个用户定义的函数来实现第二个范围过滤器(在Cassandra 2.2和更新版本中)。

Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).

假设你像这样定义你的表保持示例简单):

Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):

CREATE TABLE tasks (
    p int, 
    task_id timeuuid, 
    start int, 
    end int, 
    end_range int static, 
    PRIMARY KEY(p, start));



现在我们创建一个用户定义的函数来检查基于结束时间返回的行,并返回task_id匹配行,像这样:

Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:

CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int) 
    CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS 
    'if (end <= end_range) return task_id; else return null;';

现在我在使用第三个参数。在一个明显的(主要的)监督中,似乎你不能传递一个常量到用户定义的函数。所以为了解决这个问题,我们传递一个静态列(end_range)作为我们的常量。

Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.

所以首先我们要设置end_range:

So first we have to set the end_range we want:

UPDATE tasks SET end_range=15 where p=1;

假设我们有这些数据:

SELECT * FROM tasks;

 p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
 1 |     1 |        15 |   5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
 1 |     2 |        15 |   7 | 3233a040-4a88-11e5-a180-433e07a8bafb
 1 |     4 |        15 |  22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
 1 |     8 |        15 |  15 | 37ec7840-4a88-11e5-a180-433e07a8bafb

现在让我们得到task_id的开始> = 2, end <= 15:

Now let's get the task_id's that have start >= 2 and end <= 15:

SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks 
    WHERE p=1 AND start >= 2;

 start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
     2 |   7 |       3233a040-4a88-11e5-a180-433e07a8bafb
     4 |  22 |                                       null
     8 |  15 |       37ec7840-4a88-11e5-a180-433e07a8bafb

这样就给你匹配的task_id,空行(我还没有想出一个方法来删除行使用UDF的)。你会注意到,开始> = 2的过滤器在传递给UDF之前已经删除了一行。

So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.

无论如何不是一个完美的方法,可以工作。 :)

Anyway not a perfect method obviously, but it might be something you can work with. :)

这篇关于Cassandra - 重叠数据范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆