在基于时间的小块中查询cassandra的最有效方法 [英] Most efficient way to query cassandra in small time-based chunks

查看:291
本文介绍了在基于时间的小块中查询cassandra的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基于我的Cassandra的应用程序需要读取自上次读取以来更改的行. 为此,我们计划有一个包含两个列的表changed_rows-

My Cassandra-based application needs to read the rows changed since last read. For this purpose, we are planning to have a table changed_rows that will contain two columns -

  1. ID-更改的行的ID和
  2. Updated_Time-更改的时间戳记.

读取此类表以使其按时间顺序读取一小组行的最佳方法是什么. 示例:如果表是:

What is the best way to read such a table such that it reads small group of rows ordered by time. Example: if the table is:

ID   Updated_Time
foo    1000
bar    1200
abc    2000
pqr    2500
zyx    2900
 ...
xyz   901000
 ...

我已经证明ID是简单的3个字母的键,实际上它们是UUID. 另外,为简单起见,上面显示的时间显示为整数,但是它是实际的Cassandra时间戳(或Java日期). Updated_Time列是单调增加的列.

I have shown IDs to be simple 3-letter keys, in reality they are UUIDs. Also, time shown above is shown as an integer for the sake of simplicity, but its an actual Cassandra timestamp (Or Java Date). The Updated_Time column is a monotonically increasing one.

如果我通过以下方式查询此数据:

If I query this data with:

SELECT * FROM changed_rows WHERE Updated_Time < toTimestamp(now())

我收到以下错误:

Cannot execute this query as it might involve data filtering and 
thus may have unpredictable performance... Use Allow Filtering

但是我认为Allow Filtering在这种情况下会降低性能. Cassandra索引页警告避免对高基数列进行索引,并且上面的Updated_Time肯定是高基数.

But I think Allow Filtering in this case would kill the performance. The Cassandra index page warns to avoid indexes for high cardinality columns and the Updated_Time above sure seems like high cardinality.

我不预先知道ID列,因为查询的目的是知道在给定时间间隔之间更新的ID.

I do not know the ID column before-hand because the purpose of the query is to know the IDs updated between given time intervals.

那么在这种情况下查询Cassandra的最佳方法是什么?
我可以以某种方式更改表以更有效地运行时分查询吗?

What is the best way to query Cassandra in this case then?
Can I change my table somehow to run the time-chunk query more efficiently?

注意:这听起来有点类似于 Cassandra-CDC功能,但我们不能使用相同的名称,因为我们的解决方案适用于所有Cassandra版本

Note: This should sound somewhat similar to Cassandra-CDC feature but we cannot use the same because our solution should work for all the Cassandra versions

推荐答案

假设您知道要查询的时间间隔,则需要创建另一个表,如下所示:

Assuming you know the time intervals you want to query, you need to create another table like the following:

CREATE TABLE modified_records (
    timeslot timestamp,
    updatedtime timestamp,
    recordid timeuuid,
    PRIMARY KEY (timeslot, updatedtime)
);

现在,您可以将更新的记录日志"划分为多个时间段,例如1小时,然后按以下方式填写表格:

Now you can split your "updated record log" into time slices, eg 1 hour, and fill the table like this:

INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:36:00', 123);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:56:00', 456);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 10:00:00', '2017-02-27 10:00:13', 789);

在其中将updatedtime时间戳的一部分用作分区键,例如,在这种情况下,四舍五入到整数小时.然后,您仅通过指定时隙进行查询,例如:

where you use a part of your updatedtime timestamp as a partition key, eg in this case you round to the integral hour. You then query by specifying the time slot only, eg:

SELECT * FROM modified_records WHERE timeslot = '2017-02-27 09:00:00';
SELECT * FROM modified_records WHERE timeslot = '2017-02-27 10:00:00';

根据记录更新的频率,您可以使用更短或更长的时间片,例如每6个小时,1天或每15分钟一次.这种结构非常灵活.您只需要知道要查询的时隙.如果您需要跨越多个时隙,则需要执行多个查询.

Depending on how often your records get updated, you can go with smaller or bigger time slices, eg every 6 hours, or 1 day, or every 15 minutes. This structure is very flexible. You only need to know the timeslot you want to query. If you need to span multiple timeslots you'll need to perform multiple queries.

这篇关于在基于时间的小块中查询cassandra的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆