在cassandra中存储时间范围 [英] Storing time ranges in cassandra

查看:150
本文介绍了在cassandra中存储时间范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种很好的方法来存储与时间范围相关联的数据,以便以后能够有效地检索数据。



每个数据项可以简化为(开始时间,结束时间,值)。我将需要稍后检索所有属于(x,y)范围内的条目。在SQL中,查询将类似于



SELECT value FROM data WHERE starttime< = x AND endtime> = y



您可以建议Cassandra中的数据结构,这样我可以有效地执行这些查询吗?



我想使用Cassandra的二级索引(以及一个虚拟的索引价值,这是不幸的是,目前仍然需要)是你最好的选择。您需要为每个事件使用一行,至少包含三个列:start,end和dummy。在其中的每个上创建辅助索引。前两个可以是LongType,最后一个可以是BytesType。有关更多详细信息,请参见有关使用辅助索引的此帖子。由于您必须在辅助索引查询的至少一列上使用EQ表达式(我提到的不幸的要求),所以EQ将在'dummy'上,它总是可以设置为0.(这意味着EQ索引表达式将匹配每一行,基本上是一个无操作。)您可以将其余的事件数据存储在开始,结束和虚拟的行中。



a href =https://github.com/pycassa/pycassa> pycassa ,一个Python Cassandra客户端,您的查询将如下所示:

  from pycassa.index import * 
start_time = 12312312000
end_time = 12312312300
start_exp = create_index_expression('start',start_time,GT)
end_exp = create_index_expression('end',end_time,LT)
dummy_exp = create_index_expression('dummy',0,EQ)
子句= create_index_clause([start_exp,end_exp,dummy_exp],count = 1000)
for result in entries.get_indexed_slices(clause):
#do stuff with result

应该是在其他客户端类似的。



我考虑的替代方案首先涉及OrderPreservingPartitioner,这几乎总是一个坏事。对于索引,您将使用开始时间作为行键,完成时间作为列名称。然后,您可以使用start_key = start_time和column_finish = finish_time执行范围切片。这将扫描开始时间之后的每一行,并且只返回那些在finish_time之前的列。不是很有效率,你必须做一个大的multiget等。内置的二级索引方法是更好,因为节点将只索引本地数据,大多数样板索引代码是为你处理。


I'm looking for a good way to store data associated with a time range, in order to be able to efficiently retrieve it later.

Each entry of data can be simplified as (start time, end time, value). I will need to later retrieve all the entries which fall inside a (x, y) range. In SQL, the query would be something like

SELECT value FROM data WHERE starttime <= x AND endtime >= y

Can you suggest a structure for the data in Cassandra which would allow me to perform such queries efficiently?

解决方案

This is an oddly difficult thing to model efficiently.

I think using Cassandra's secondary indexes (along with a dummy indexed value which is unfortunately still needed at the moment) is your best option. You'll need to use one row per event with at least three columns: 'start', 'end', and 'dummy'. Create a secondary index on each of these. The first two can be LongType and the last can be BytesType. See this post on using secondary indexes for more details. Since you have to use an EQ expression on at least one column for a secondary index query (the unfortunate requirement I mentioned), the EQ will be on 'dummy', which can always set to 0. (This means that the EQ index expression will match every row and essentially be a no-op.) You can store the rest of the event data in the row alongside start, end, and dummy.

In pycassa, a Python Cassandra client, your query would look like this:

from pycassa.index import *
start_time = 12312312000
end_time = 12312312300
start_exp = create_index_expression('start', start_time, GT)
end_exp = create_index_expression('end', end_time, LT)
dummy_exp = create_index_expression('dummy', 0, EQ)
clause = create_index_clause([start_exp, end_exp, dummy_exp], count=1000)
for result in entries.get_indexed_slices(clause):
    # do stuff with result

There should be something similar in other clients.

The alternative that I considered first involved OrderPreservingPartitioner, which is almost always a Bad Thing. For the index, you would use the start time as the row key and the finish time as the column name. You could then perform a range slice with start_key=start_time and column_finish=finish_time. This would scan every row after the start time and only return those with columns before the finish_time. Not very efficient, and you have to do a big multiget, etc. The built-in secondary index approach is better because nodes will only index local data and most of the boilerplate indexing code is handled for you.

这篇关于在cassandra中存储时间范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆