执行cassandra表的范围查询 [英] Performing range queries for cassandra table

查看:1487
本文介绍了执行cassandra表的范围查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用以下模式存储数据:

  CREATE TABLE temp_humidity_data(
asset_id text,
date text,
event_time timestamp,
temprature int,
humidity int,
PRIMARY KEY((asset_id,date),event_time)

我已遵循datastax文章时间序列建模入门 -

org / blog / post / getting-started-with-time-series-data-modeling /rel =nofollow> http://planetcassandra.org/blog/post/getting-started-with-time-series-data -modeling /



然而对于这个数据模型,一个不起作用的事情是查询,它返回两个日期之间的数据。



如果我这样做:

  select * from temp_humidity_data 
其中asset_id ='1234'AND date> ='2010-04-02'AND date< ='2011-04-03';

它给我以下错误:


code = 2200 [无效的查询] message =只有EQ和IN关系是
支持分区键(除非你使用token()函数)


在理解有一种方法可以做IN运算符,但我不想把所有这些日期在一个'IN'运算符。在两个日期之间使用上面的表定义数据时,有没有办法查询?

解决方案

主键中的第一个键一个复合在你的情况下)负责散布在不同的分区的数据。每个分区整体保持在单个节点(及其副本)上。即使您请求的查询是可能的(如果您只有日期作为主要和使用byteorderedpartitioner - 默认是murmur3),它将有效地对整个群集进行全面扫描。考虑这与在没有索引的列上的rdbms中的全表扫描类似,只是现在全表扫描跨越多个机器。



复合的目标分区键这里是确保没有分区变得难以控制大。它也会导致您无法跨日期进行范围查询。如果您认为资产的数据适合单个分区,则可以将日期设置为第一个集群键。这将启用查询。但是,资产的所有行都位于单个分区上。这可能是一个问题(通常很好的目标大约100MB作为最大分区大小 - 虽然有例外),并且热点可能会出现在您的群集(节点持有分区,非常忙的东西将忙,而其他节点不那么) 。另一种方法是维护手动桶 - 添加一个bucketid int作为分区键的一部分[即(asset_id,bucket_id)],将日期作为第一个集群密钥,并维护应用程序代码。这将为您控制的多个分区分配资产的数据。这需要一些计算,并需要您自己查询每个存储桶,但会阻止热点,并允许您的日期范围查询。如果特定资源的数据超出单个分区大小,但通过存储桶可管理,则显然只能这样做。



如果您绝对必须根据日期进行分区,像火花和鲨鱼这样的高效后期聚合。



希望有帮助。


I am trying to store data with following schema:

 CREATE TABLE temp_humidity_data (
                asset_id text, 
                date text, 
                event_time timestamp, 
                temprature int, 
                humidity int,
                PRIMARY KEY((asset_id, date),event_time)
            )

I have followed datastax article 'Getting Started with Time Series Modeling' - http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/

however with this data model one thing that is not working is query that returns me data between two dates. How do I do that?

If I do this:

select * from temp_humidity_data 
where asset_id='1234' AND date >= '2010-04-02' AND date <= '2011-04-03';

It gives me following error:

code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)"

In understand there is a way to do IN operator but I don't want to put all those dates in a 'IN' operator. Is there a way to query when using the above table definition data between two dates?

解决方案

The first key in the primary key (a composite in your case) is responsible for scattering the data across different partitions. Each partition is held in its entirety on a single node (and its replicas). Even if the query you request were possible (it would be if you had only the date as a primary and used a byteorderedpartitioner - the default is murmur3), it would effectively do a full scan across your cluster. Think of this being similar to a full table scan in an rdbms on a column without an index, only now the full table scan spans multiple machines.

The goal with the composite partition key here is to ensure no partition gets unmanageably big. It also takes away your ability to do the range query across dates. If you think your data for an asset can fit in a single partition, you can make the date the first clustering key. That would enable the query. However, all rows for an asset would be on a single partition. This may be an issue (it's typically good to target around 100MB as a max partition size - though there are exceptions), and hotspots may arise in your cluster (nodes holding partitions for very busy stuff will be busy, while other nodes less so). Another way around this is to maintain manual buckets - add a bucketid int as part of the partition key [i.e. (asset_id, bucket_id)], have date as the first clustering key, and maintain the bucketing from application code. This would distribute the data for an asset across multiple partitions that you control. This will need a bit of calculation, and will need you to query each bucket yourself - but will prevent hotspots, and allow your date range queries. You'd obviously only do this if the data for a particular asset is beyond single partition size, but manageable via buckets.

If you absolutely must partition based on date, consider things like Spark and Shark to do efficient post aggregation.

Hope that helps.

这篇关于执行cassandra表的范围查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆